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ABSTRACT 


In  this  dissertation  the  method  of  identity  coefficients  is  used 
to  study  several  aspects  of  the  genetic  structure  of  populations.  This 
method  involves  the  construction  of  recursion  relationships  for  the 
probabilities  that  a  specific  sample  of  gametes  are,  or  are  not, 
identical . 

The  problems  considered  are  the  expected  amount  of  squared 
linkage  disequilibrium  between  three  loci  in  a  random  mating 
population  and  between  two  loci  in  a  partially  selfing  population.  The 
variance  of  the  two-locus,  squared  linkage  disequilibrium  in  a  random 
mating  population  is  examined.  The  effects  of  intragenic  recombination 
among  three  sites  within  a  gene  are  determined.  The  effect  of 
intragenic  recombination  within  a  hybrid  population  and  the  effect  on 
the  variance  of  homozygosity  are  examined.  The  variance  of 
homozygosity  within  a  structured  population  is  derived. 
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Chapter  1 


Introduct i on 


Theoretical  population  genetics  is  an  attempt  to  describe 
mathematically  the  genetic  structures  of  populations  and  to  determine 
the  consequences  of  natural  processes  (such  as  Mendelian  inheritance 
and  selection  on  such  structures) .  In  the  following  chapters  several 
different  problems  relevant  to  theoretical  population  genetics  are 
examined.  Each  chapter  deals  with  a  different  problem. 

All  of  these  problems  are  connected  by  the  method  used  to  solve 
them.  This  method  has  a  long  history  and  requires  an  introduction.  In 
general,  the  approach  is  to  define  a  variable,  or  a  set  of  variables 
which  define  the  identity  relations  among  a  specific  set  of  genes  and 
to  determine  how  these  variables  change  each  generation.  The  resulting 
system  of  recursion  relationships  is  at  least  informative  and  can 
often  be  solved. 

While  studying  the  effects  of  inbreeding  on  guinea  pigs,  Wright 
(1921)  found  it  useful  to  define  a  coefficient,  f,  which  describes  the 
degree  of  inbreeding.  This  coefficient  was  defined  as  the  genetic 
correlation  between  uniting  gametes  for  a  model  with  two  equally 
frequent  alleles.  Wright  (1922)  then  extended  the  method  and  allowed 
the  correlation  f  to  be  defined  for  all  gene  frequencies.  He  called 
this  correlation  the  inbreeding  coefficient.  The  concept  of  a  variable 
to  define  the  effects  of  inbreeding  proved  to  be  very  useful.  Wright 
found  that  if  the  population  was  initially  in  Hardy-Weinberg 
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equilibrium  then  the  expected  frequency  of  the  genotypes  AA,  Aa  and  aa 
would  be  p*-  +  pqf,  2pq(l~f)  and  q“  +  pqf.  Thus,  inbreeding  increases 
the  frequency  of  homozygotes  and  decreases  the  frequency  of 
heterozygotes.  Perhaps  based  on  this  simple  formula,  Wright  (1951) 
changed  the  definition,  notation  and  name  of  f.  He  defined  the 
"inbreeding  coefficient  (or  fixation  index)  F"  as  the  deviation  from 
Hardy-Weinberg  proportions. 

Another  approach  to  define  the  degree  of  inbreeding  was  taken  by 

Bernstein  (1930) ,  Haldane  and  Moshinsky  (1939) ,  Cotterman  (1940)  and 

Malecot  (1948).  These  authors  used  only  probability  arguments.  Malecot 

(1948)  defined  the  "coefficient  de  consanguini te  f.  "  as  the 

M 

probability  that  the  two  genes  of  an  individual,  M,  are  identical  by 
descent.  This  has  caused  some  confusion  since  f,  as  a  probability,  can 
take  values  (0,  1)  while  f,  as  a  correlation,  can  take  values  (~1,  1) . 
Furthermore,  when  the  correlation  is  greater  than  zero,  both  indices 
are  equivalent  for  many  models.  It  is  therefore  important  to  realize 
that  these  quantities  are  distinct  even  though  they  may  have  the  same 
values.  Unfortunately,  the  name  "inbreeding  coefficient"  (variously 
designated  f  or  F)  for  both  indices  is  now  firmly  entrenched  in  the 
literature  (eg:  Caval 1 i~Sf orza  and  Bodmer,  1971;  Crow  and  Kimura, 
1970;  Jaquard,  1974;  Lewontin,  1974;  Li,  1978;  Malecot,  1969; 
Roughgarden,  1979;  Spiess,  1977;  Wright,  1969).  In  the  following, 
"inbreeding  coefficient"  will  refer  to  Malecot's  definition. 

Malecot  (1948)  extended  the  method  to  consider  the  probability 

that  the  genes  of  two  individuals  are  identical  by  descent.  He  defined 

* 

the  "coefficient  de  parente  f  as  the  probability  that  a  gene, 
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chosen  at  random  from  individual  I,  is  identical  by  descent  to  a  gene 
chosen  at  random  from  individual  L.  This  coefficient  has  also  proved 
to  be  very  useful  and  is  in  common  use. 

A  cursory  examination  of  popular  texts  demonstrates  that  many 

>* 

different  names  are  used  for  this  coefficient.  Malecot  (1969)  has 
translated  it  as  the  "coefficient  of  coancestry"  and  also  as  the 
"coefficient  of  kinship"  (from  Crow  and  Kimura,  1970) .  Crow  and  Kimura 
(1970)  call  it  both  the  "coefficient  of  kinship"  and  the  "coefficient 
of  consanguinity".  Jaquard  (1974),  Ewens  (1979)  and  Caval 1 i~Sf orza  and 
Bodmer  (1971)  use  the  term  "coefficient  of  kinship"  while  Kempthorne 
(1957)  uses  "coefficient  of  parentage".  Falconer  (I960)  and 
Roughgarden  (1979)  prefer  "coefficient  of  coancestry".  And  finally, 
Spiess  (1977)  calls  it  the  "coefficient  of  transmission".  The 
confusion  is  added  to  by  Roughgarden ' s  (1979)  definition  of  the 
"kinship  coefficient"  as  the  mean  number  of  alleles  identical  by 
descent  between  two  individuals  (hence  varying  from  0  to  2) . 

These  two  variables  are  not,  however  the  final  word.  The  whole 
concept  can  be  extended  quite  generally  to  consider  the  probabilities 
of  identity  among  the  genes  of  any  particular  sample  of  gametes.  Again 
this  has  been  done  independently  by  several  authors. 

Harris  (1964)  described  a  set  of  coefficients  to  study  inbreeding 
and  called  these  "probabilities  of  alikeness  by  descent".  Cockerham 
(1971)  has  also  defined  a  set  and  simply  termed  the  coefficients  as 
probabilities,  or  as  two,  three,  etc.  "-gene  probability  functions". 
Gillois  (1964),  Jaquard  (1974),  Chevalet  and  Gillois  (1977,  1978)  and 
Chevalet  et  al.  (1977)  have  defined  a  set  of  coefficients  to  describe 


the  genetic  structure  of  a  population.  They  use  the  name  "coefficients 
of  identity  by  descent".  The  inbreeding  coefficient  and  the 
coefficient  of  kinship  (or  whatever  is  preferred)  are  relegated  to 
special  identity  coefficients.  Some  of  these  coefficients  are 
identical  to  Cotterman's  "k  coefficients".  Serant  (1974,  1976)  defines 
a  set  of  coefficients  to  deal  with  two-locus  problems  (see  also 
Haldane,  1950  for  the  effects  of  inbreeding  with  linked  loci).  He 
calls  all  of  these  "inbreeding  coefficients"  and  uses  a  "kinship 
process"  to  derive  the  recursion  relationships.  Finally,  Cockerham  and 
Weir  (1968)  define  another  set  of  coefficients  which  they  call 
"descent  measures"  (Cockerham  and  Weir,  1973). 

Herein,  we  have  followed  Jaquard,  Gillois  and  Chevalet  and  called 
the  variables  we  are  using,  "identity  coefficients".  Again  following 
these  authors,  we  have  designated  these  variables  with  Greek  letters. 
We  do  not  however,  impose  the  restriction  that  the  relationship  among 
gametes  must  be  "by  descent".  We  consider  only  their  identity  by 
"state",  whether  or  not  the  gametes  carry  the  same  alleles.  This  is  a 
simpler  assumption  and  allows  different  (and  perhaps  easier) 
formulations  to  be  made  for  some  models. 

In  the  following  chapters  several  different  problems  are  examined 
to  demonstrate  the  power  of  this  method.  Hence  several  different  sets 
of  coefficients  are  required  and  are  defined  in  each  chapter.  The 
definition  of  each  coefficient  is  valid  only  for  that  chapter. 
Although  a  general  definition  of  all  coefficients  may  be  possible, 
this  definition  would,  of  necessity,  be  complicated.  As  stated  above, 
coefficients  can  be  defined  for  any  desired  probability.  Therefore,  it 


seemed  appropriate  to  define  each  coefficient  in  the  context  of  the 
problem  and  to  use  a  notation  which  imparts  as  much  information  as 
possible  in  a  simple  manner.  Beyond  this,  attempts  were  made  to  define 
the  coefficients  in  a  consistent  way.  For  other  symbols,  the  notation 
which  is  standard  in  the  literature  is  used. 

For  the  most  part,  the  recursion  relationships  are  placed  in 
appendices.  This  helps  to  make  the  biological  relevance  of  the  results 
clear.  Most  of  these  recursion  relationships  were  solved  numerically 
using  the  IMSL  subroutine  LEQT1F  on  an  Amdahl  470V/8  in  double 
precision.  This  subroutine  solves  the  equations  using  Gaussian 
elimination  (Crout  algorithm)  with  partial  pivoting  and  row  scaling. 
Since  the  equations  solved  are  approximations,  the  answers  obtained 
are  accurate  to  the  degree  that  these  approximations  are  correct.  To 
be  precise,  the  relative  size  of  the  perturbation  in  the  answers  is  of 
the  same  order  of  magnitude  as  the  relative  size  of  the  perturbation 
in  the  system  of  equations  (Noble  and  Daniel,  1977,  Theorem  5.9).  The 
graphs  were  produced  using  the  University  of  Alberta's  CGPL  and  CPLT3D 
subroutines . 

All  of  the  problems  approached  involve  a  finite  population  size 
and  fall  into  two  broad  categories.  The  first  category  deals  with 
linkage  disequilibrium  (a  measure  of  the  correlation  between  alleles 
at  different  loci)  in  a  finite  population  and  in  the  absence  of 
selection.  The  second  category  deals  with  some  effects  of  intragenic 
recombination.  In  chapter  2  it  is  shown  that  the  effect  of 
recombination  within  a  gene  on  the  expected  homozygosity  increases 
when  recombination  can  occur  between  a  larger  number  of  sites.  The 


same  model  shows  that  the  squared  linkage  disequilibrium  between  three 
loci  can  be  relatively  large.  In  chapter  3  the  expected  squared 
linkage  disequilibrium,  when  partial  selfing  occurs,  is  determined.  It 
is  shown  that  the  linkage  disequilibrium  can  be  much  larger  than  that 
expected  for  a  randomly  mating  population.  The  theory  developed  for 
randomly  mating  populations  can  be  applied  to  a  population  with 
partial  selfing  using  a  simple  transformation  which  defines 
"effective"  values  for  the  rate  of  recombination  and  the  population 
size.  Chapter  4  examines  the  ability  of  recombination  within  genes  to 
create  unique  alleles  in  hybrids.  It  is  also  shown  that  when  such 
recombination  occurs,  a  randomly  mating  population  may  maintain  a 
greater  number  of  alleles  than  does  a  structured  population.  Chapter  5 
describes  a  method  to  find  the  variance  of  homozygosity  in  a 
structured  population.  It  is  shown  that  the  variance  is  quite 
sensitive  to  the  amount  of  migration.  However,  the  variance  can  be 
accurately  estimated  from  the  amount  of  homozygosity.  Chapter  6 
examines  the  variance  of  homozygosity  for  a  gene  which  consists  of  two 
sites,  and  the  variance  of  squared  linkage  disequilibrium.  It  is  shown 
that  the  coefficient  of  variation  of  the  linkage  disequilibrium  is 


often  greater  than  100%. 


' 


Chapter  2 


The  Expected  Variance  of  Linkage  Disequilibrium  Between 
Three  Loci  in  a  Finite  Population 


Introduct i on 


Since  linkage  disequilibrium  can  be  generated  by  epistatic 
selection  between  loci,  this  quantity  has  been  extensively  used  in  the 
search  for  the  effects  of  selection  in  natural  populations.  Linkage 
disequilibrium  can  also  be  generated  by  other  factors  such  as  recent 
admixture,  assortative  mating  and  random  drift  in  a  finite  population. 
Therefore,  in  order  to  determine  whether  the  observed  values  could  be 
caused  by  random  drift,  it  is  necessary  to  determine  the  expected 
value  of  the  variance  of  linkage  disequilibrium  in  a  finite  population 
without  selection.  This  was  done  by  Hill  and  Robertson  (1968)  and  Ohta 
and  Kimura  (1969)  for  the  two-locus  model,  with  two  alleles  per  locus 
and  no  mutation,  by  Weir  and  Cockerham  (1974)  for  the  two-locus  model 
without  mutation  and  by  Hill  (1975)  for  the  infinite  alleles,  two- 
locus  model.  Experimental  studies  of  linkage  disequilibrium,  while 
determining  such  two-locus  linkage  disequi 1 ibria,  often  determine  the 
strength  of  three-locus  linkage  disequi  1  ibria  as  well  (eg:  Allard  et^ 
al .  ,  1972;  Langley  et^  a^. ,  1974;  Mukai  ej:  a_L .  ,  1974;  Brown  et^  al . , 
1977) •  The  study  of  the  expected  value  of  the  variance  of  three-locus 
linkage  disequilibrium  in  a  finite  population  has  been  less  well 
developed.  Hill  (1974a,  1974b)  has  studied  the  transient  behavior  of 
three-locus  linkage  disequilibrium  with  two  alleles  at  each  locus  and 
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has  found  that,  although  three-locus  linkage  disequilibrium  decays 
faster  than  two-locus  linkage  disequilibrium,  both  can  reach 
appreciable  levels  before  declining  to  zero. 

Here,  the  expected  variance  of  three-locus  linkage  disequilibrium 
is  determined  in  the  absence  of  selection.  This  is  done  for  the 
infinite  alleles  model  (Kimura  and  Crow,  1964) ,  using  identity 
coefficients  and  then  solving  the  equations  numerically  on  a  computer. 
It  is  shown  that  the  variance  of  three-locus  linkage  disequilibrium  is 
of  the  same  order  of  magnitude  as  the  variance  of  two-locus  linkage 
disequilibrium.  Hence,  even  if  third  order  linkage  disequilibrium  is 
observed  at  appreciable  values  between  closely  linked  loci  this  is  not 
necessarily  an  indication  of  selection.  This  model  can  also  be 
interpreted  as  intragenic  recombination  between  three  sites  to  show 
that  a  gene  consisting  of  three  sites  can  have  many  more  alleles 
present  than  genes  with  either  one  or  two  sites. 


Theory 


The  three  loci  are  designated  as  A,  B  and  C  with  mutation  rates, 
v j  »  V2  and  V3,  respectively.  Let  r12  be  the  probability  of 
recombination  between  A  and  B,  r23  between  B  and  C  and  r^2  between  A 
and  C.  It  is  useful  to  define  x1  as  the  probability  of  recombination 
between  B  and  C  but  not  between  A  and  B,  x2  as  the  probability  of 
recombination  between  A  and  B  but  not  between  B  and  C  and  x3  as  the 
probability  of  recombination  between  A  and  B  and  between  B  and  C. 
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Hence 

X1  =  ^(r13+r23_r12) 
x2  =  h(r i2+r 1 3_r23^ 
x3  =  ^^r12  +  r23_rl  3^ 

(Strobeck ,  1976) . 

The  three-locus  model  used  here  assumes  a  finite  population  size 
of  2N  gametes  and  the  infinite  alleles  model  of  Kimura  and  Crow 
(1964) .  Following  the  Wright-Fisher  model,  the  chromosomes  of  the 
present  generation  are  derived  by  a  random  sampling,  with  replacement, 
from  the  chromosomes  of  the  past  generation.  Each  gamete  chosen  may  or 
may  not  be  a  recombinant  with  the  probabilities  given  above.  For 
example,  for  any  two  arbitrary  gametes  ajbjCj  and  a2^2c2’  the  meiotic 
product  a1b1c2  will  be  selected  with  probability  h*i •  The  sampling 
process  is  continued  until  2N  new  gametes  have  been  generated.  A 
similar  method  was  used  (Strobeck  and  Morgan,  1978)  to  analyze  a  two 
site  model. 

To  describe  the  behavior  of  the  system  from  one  generation  to  the 
next  requires  twenty  eight  different  identity  coefficients.  Three  of 
these  variables  define  the  probability  of  identity  at  each  of  the 
three  loci,  A,  B  and  C.  As  has  been  shown  for  the  two-locus  model, 
another  three  variables  are  required  when  genes  at  two  loci  are 
considered  jointly.  Since  there  are  three  pairs  of  loci  for  the  three- 
locus  model  (AB,  AC  and  BC) ,  this  adds  another  nine  variables.  A 
further  sixteen  are  necessary  when  the  three  loci  are  considered 
jointly.  These  coefficients  involve  choosing  two  to  six  distinct 
gametes  at  random  without  replacement.  For  clarity  those  which  involve 


M 


two  distinct  gametes  are  denoted  by  $ ,  three  gametes  by  T,  four 
gametes  by  A»  five  gametes  by  A  and  those  involving  six  distinct 
gametes  by  y.  Letters  in  the  subscripts  indicate  which  loci  are  being 
considered  and  a  slash  is  used  to  seperate  those  loci  which  come  from 
different  chromosomes.  The  coefficients  are  defined  in  Table  2.1.  The 
sixteen  coefficients  used  by  Hill  (1974b,  Table  3) ,  each  a  product  of 
gamete  frequencies,  can  be  derived  from  these  coefficients  by  a  linear 
transformation. 


If  is  the  probability  that  two  genes  are  identical,  then 

has  the  recursion  relationship 


*A/A  “  (1'Ul) 


2N  +  l'1  2N')4’a/A 


where  V]_  is  the  mutation  rate  to  neutral,  distinct  alleles  at  that 
locus  (Kimura  and  Crow,  1964) .  The  recursion  relationships  for  two 
linked  loci,  (  tM/AB  ,  rAB/A/B  and  AyA/B/B  )  are  derived  by 

Strobeck  and  Morgan  (1978).  Those  for  ®c/C’  $BC/BC’  $ AC/AC’ 

rBC/B/C  ’  rAC/A/C’  AB/B/C/C  a"d  AA/A/C/C  are  the  sane  excaPt  that 

mutation  rates  and  recombination  rates  have  to  be  changed 

appropriately.  These  recursion  relationships  are  included  in  Appendix 

1. 


The  recursion  relationships  for  three  loci  are  more  complicated 

than  those  for  two  loci  and  it  would  be  very  time  consuming  to  write 

them  down.  It  is  therefore  useful  to  make  some  initial  approximations. 

-1 

It  is  assumed  that  N  >>  1,  v  =  0 (N  )  and  all  of  the  recombination 

i 

-1  -2 

parameters  are  of  0(N  ).  Terms  with  0(N  )  are  neglected  in  writing 


the  recursion  equations  since  these  terms  will  affect  the  answers  only 


--- 

Table  2.1:  Definitions  of  identity  coefficients  for  three 
linked  loci. 
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Where  "Prob (a^a^ ) "  is  the  probability  that  the  allele  at  locus 
from  the  i-th  gamete  is  identical  to  the  allele  at  locus  A  from 
the  j-th  gamete. 


The  recursion 


-2 

to  0(N  )  (Noble  and  Daniel,  1977,  Theorem  5.9). 

relationships  for  the  expected  values  of  the  28  coefficients  over 
replicate  populations  are  given  in  Appendix  1.  Additionally  let  \>\ 
=  V2  =  V3  =  v,  ri2  =  ^23  =  r  and  r^3  =  ri2+r23*  This  is  equivalent  to 
a  model  with  complete  interference.  However,  a  model  without 
interference  implies  r 3 3  =  r^  +  r23  ~  ^rl2r23  but  by  assumption  r^, 
r23  <<:  ^  and  thus  r13  =  r^2  +  r23*  The  model  therefore  holds  both  with 
or  without  interference.  Making  these  substitutions  in  the  equations 
implies  that 

$A/A  =  $B/B  =  $C/C 
^AB/AB  =  $BC/BC 
rAB/A/B  =  rBC/B/C 
AA/A/B/B  =  AB/B/C/C 
rABC/AB/C  =  rABC/BC/A 
AAB/AB/C/C  =  ABC/BC/A/A 

aab/ac/b/c  =  abc/ac/a/b 
aab/a/b/c/c  =  abc/a/a/b/c 

which  reduces  the  number  of  necessary  coefficients  to  nineteen.  Even 
if  an  explicit  equilibrium  solution  is  obtained  for  each  of  the 
coefficients,  such  solutions  would  be  too  complicated  to  be  of  value 
and  therefore  the  equilibrium  solutions  for  various  parameter  values 
are  obtained  numerically  on  a  computer. 

Cockerham  and  Weir  (1973) ,  Serant  (1974,  1976)  have  shown  that 
there  is  a  simple  relationship  between  these  identity  coefficients  and 
gene  frequency  moments,  as  is  shown  in  Table  2.2.  This  table  gives  the 
gene  frequency  moment  to  which  each  of  the  nineteen  coefficients 


Table  2.2:  Relations  between  the  expected  values  of  the 
identity  coefficients  over  replicate  populations  and  the 
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correspond  in  expectation.  Here,  f  is  the  frequency  of  the  gamete 
carrying  the  i~th  allele  at  locus  A,  the  j-th  allele  at  locus  B  and 
the  k~th  allele  at  locus  C;  f..  is  the  frequency  of  the  gamete 
carrying  the  i~th  allele  at  locus  A  and  the  j~th  allele  at  locus  B; 
p q.  and  r,  are  the  frequencies  of  alleles  i,  j  and  k  at  loci  A,  B 

1  J  K. 

and  C  respectively.  The  expected  linkage  disequilibrium,  E(D^), 

between  alleles  a.  and  b.  can  be  expressed  as 

i  J  v 

E(D.  .)  =  E(f  .  .  -  p.q.) 

1J  1J  i  J 

The  natural  extension  of  this  quantity  for  three  loci  is 

E(Dijk)  =  E(£ijk '  £-jkpi '  £i-kqj  '  £ij-rk +  2piqjrk) 

(Hill,  1976).  For  the  model  used  here  both  E(D..)  and  E(D..,)  are  zero 

ij  1 J  k 

for  all  i,  j  and  k,  however  they  have  non~trivial  variances.  Hill 
(1975)  determined  that  at  equilibrium 


E<i5D2u> 


2  2 

16N  v  (8Nv+2Nr+5) 


3  3  3  2  3  2  2  2  2  2  2 

(l+4Nv) (256N  v  +192N  v  r+32N  vr  +320N  v  +152N  vr+8N  r  +108Nv+26Nr+9) 


ZEZ  ^ 

With  three  loci  E(.  D..,)  can  be  approximated  as 

1J  K  1 J  K. 


'll!  2 
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^ ABC/ABC  “  2rABC/BC/A  ~  2rABC/AC/B  ~  2rABC/AB/C  + 
4aabc/a/b/c  +  abc/bc/a/a  +  2AAC/BC/A/B  +  2aab/bc/a/c  “ 
4Abc/a/a/b/c  +  aac/ac/b/b  +  2aab/ac/b/c  “  4Aac/a/b/b/c  + 
aab/ab/c/c  “  4aab/a/b/c/c  +  ^A/A/B/B/C/C 


as  long  as  N  is  large.  Using  the  values  obtained  for  the  identity 

EEZ  ^ 

coefficients,  E(..,D..,)  can  be  determined  for  any  values  of  Nr  and 

ijk  ljk 

Nv. 


yy  2.  yyy  2 

Figure  2.1  gives  the  value  of  E(  .D..)  (2.1a)  and  E(...D.  ) 

i J  il  ijk  ijk 

-2  2 

(2.1b)  for  4Nv  =  2.0,  1.0  and  0.5  with  10  <  Nr  <  10  .  It  shows  that 


while  the  variance  of  three-locus  linkage  disequilibrium  is  generally 


smaller  than  two-locus  they  are  of  the  same  order  of  magnitude.  Both 


second  and  third  order  linkage  di sequi 1 ibr ia  change  values  slowly  when 


Nr  is  less  than  0.01  and  are  negligble  when  Nr  is  greater  than  10.0. 


If  r  =  0  then 


ZEE/  . 

.  D  .  .,  )  = 
ijk  ijk 


3  5  4  3  2 

0  (120  +1300  +5340  +10740  +10860+460) 


2 

(1+0)  (2+0) (3+0) (5+0) (1+20) (3+20) (1+30) (10+30) 


(where  0  =  4Nv) ,  which  is  thus  a  good  approximation  whenever 
Nr  <  0.01. 

This  model  can  also  be  interpreted  as  three  sites  within  a  single 
gene  rather  than  three  seperate  loci.  This  interpretation  is 
appropriate  for  genes  which  contain  introns,  a  feature  of  eukaryotic 
genes  (Gilbert,  1978;  Crick,  1979).  For  this  model  the  coefficient 
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/ 


■ 


. 


Figure  2.1:  The  expected  squared  linkage 
two  loci,  (a  E  (77D . . ) ,  and  between  three 


disequilibrium  between 

loci,  (b)  EC 


,EZE  2 


ijkDijk}  * 
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0.08  r  (a) 

_  4Nv=2 . 0 


0.08  r 


(b) 


0.06  - 


.ZEE  -  )  0.04- 

E(ijkDijk; 


4Nv=2. 0 


Nr 


$  is  approximately  the  expected  homozygosity  of  a  single  gene 
ABC / ABC 

with  three  sites.  Figure  2.2  gives  the  effective  number  of  alleles 
(one  over  the  homozygosity)  for  0  <  Ny  <  2.0,  and 
r  =  0,  y,  2y ,  5y ,  lOy  and  r  >>  y,  where  y  refers  to  the  total  mutation 
rate  of  the  gene  (y  =  3v  for  the  three  site  model)  and  similarly 
r  =  r13  =  r1?+r23  is  the  recombination  rate  within  the  whole  gene.  The 
effective  number  of  alleles  is  larger  when  the  gene  consists  of  three 
sites  than  when  the  gene  consists  of  two  sites.  However,  a  comparison 
of  Figure  2.2a  taken  from  Strobeck  and  Morgan  (1978,  Figure  1)  with 
Figure  2.2b  shows  that  the  two  site  model  is  a  good  approximation  of 
the  three  site  model  if  r  <  2y. 


Summary 


The  variance  of  three-locus  linkage  disequilibria  for  an 
equilibrium  infinite  alleles  model  is  solved  numerically  on  a 
computer,  using  identity  coefficients.  It  is  shown  that  the  variance 
of  three-locus  linkage  disequilibrium  created  by  random  drift, 
although  smaller  than  the  variance  of  two-locus  linkage 
disequilibrium,  is  of  the  same  order  of  magnitude.  Hence  third  order 
disequilibria  are  not  necessarily  good  indications  of  selection.  The 
formula  for  the  variance  of  linkage  disequilibrium  is  given  when  there 
is  no  recombination  between  the  genes.  This  model  can  also  be 
interpreted  as  intragenic  recombination  between  three  sites  within  a 


gene . 


Figure  2.2:  The  effective  number  of  alleles  for  a  gene  consisting 
of  two  sites  (a),  and  of  three  sites  (b)  . 
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Chapter  3 


Linkage  Disequilibrium  in  a  Finite  Population  that  is  Partially 

Sel f ing 


Introduct i on 


There  have  been  several  studies  on  the  amount  of  linkage 
disequilibrium  found  in  natural  populations.  Most  of  these  studies 
found  no  significant  linkage  disequilibrium  between  loci  that  are  not 
associated  with  an  inversion  (Lewontin,  1974;  Langley,  Ito  and 
Voelker,  1977) .  However,  in  plant  populations  that  are  partially 
selfing,  a  significant  amount  of  linkage  disequilibrium  is 
consistently  present  (Brown,  1979).  This  observed  linkage 
disequilibrium  could  be  generated  either  by  selection  with  epistatic 
interactions  between  the  loci  or  by  random  drift.  In  order  to 
determine  whether  or  not  this  observed  disequilibrium  could  be  a 
result  of  random  drift,  it  is  necessary  to  know  the  amount  of  linkage 
disequilibrium  expected  in  a  partially  selfing  finite  population 
without  selection. 

The  expected  amount  of  linkage  disequilibrium  in  a  finite 
population  with  random  mating  has  been  studied  extensively.  These 
studies  have  assumed  two  alleles  at  each  locus  with  no  mutation  (Hill 
and  Robertson,  1968;  Ohta  and  Kimura,  1969) ,  a  two-locus  model  with  no 
mutation  (Weir  and  Cockerham,  1974)  or  an  infinite  number  of  alleles 
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at  each  locus  with  mutant  alleles  differing  from  all  pre-existing  ones 
(Hill,  1975)  i.e.,  the  infinite-allele  model  of  Kimura  and  Crow 
(1964).  In  this  chapter,  the  amount  of  linkage  disequilibrium  expected 
in  a  finite  population  assuming  the  infinite  allele  model  and  partial 
selfing  is  derived  using  identity  coefficients.  It  is  shown  that  the 
formulas  for  the  expected  sum  of  squares  of  the  linkage  disequilibria 
and  the  squared  standard  linkage  disequilibrium  are  equivalent  to 
those  from  random  mating  with  a  reduced  recombination  value  and  a 
reduced  population  size. 


Theory 


Before  considering  random  drift  of 
population  that  is  primarily  selfing, 
developed. 


two  loci  in  a  finite 
the  one-locus  model  is 


Let  the  population  consist  of  N  diploid  individuals  that  produce 
offspring  by  both  selfing  and  outcrossing.  Let  S  be  the  proportion  of 
the  offspring  of  an  individual  that  are  produced  by  selfing  and  1~S 
the  proportion  of  offspring  produced  by  outcrossing.  Each  of  the  N 
individuals  in  the  next  generation  is  the  offspring  of  either  one 
individual  selected  at  random  (if  it  is  produced  by  selfing)  or  two 
individuals  selected  at  random  without  replacement  (if  it  is  produced 
by  outcrossing)  from  the  present  generation.  If  S  =  1/N,  then  there  is 
random  union  of  gametes. 


Two  identity  coefficients  are  needed  to  describe  the  behavior  of 


the  system  from  one  generation  to  the  next.  One  coefficient,  ^(a/A) ’ 

is  the  probability  that  the  two  genes  of  an  individual  are  identical 

* 

(Malecot's  inbreeding  coefficient).  The  other  coefficient,  ^(a)(A)’ 

the  probability  that  two  genes  selected  from  two  different  individuals 

are  identical  (Malecot's  kinship  coefficient).  (The  notation  used  for 

the  subscripts  is  explained  when  considering  the  two-locus  model.) 

Since  the  probability  of  an  offspring  having  its  two  genes  identical 

is  1/2  +  1/2'i'  ...  if  it  is  produced  by  selfing  and  $/aWaN  if  it  is 

(A/A)  (A)  (A) 

produced  by  outcrossing. 


V/A)'  "  (1'u)  <S(S^V/A)>  +  (1-S)V)(A)} 

4>(A)(A)'  =  <1_u:)  (N(,'2+'lT(A/A))  +  (1"  N)4(A)(A)} 


(la) 

(lb) 


where  y  is  the  mutation  rate  to  unique  alleles.  These  relations  define 
the  expected  value  of  the  coefficients  in  the  next  generation  in  terms 
of  their  previous  values. 


If  N  >>  1  and  y  0(— )  *  then  these  equations  can  be  approximated 


by 


(A/A) 


f  — 


S(^(A/A)}  +  (1  S)$(A)(A) 


(2a) 


if  terms  of  0(— )  or  less  are  neglected,  and 


$ 


=  An 


(A)  (A)  N 


(k+kv 


(A/A) 


)  +  (1-  ±  -  2y)$ 


(A)  (A) 


(2b) 


if  terms  of  0(^7)  or  less  are  neglected.  At  equilibrium 


HS  +  (1-S)$ 


¥ 


(A)  (A) 


(A/A) 


1  -  kS 


from  (2a)  and  substituting  this  value  into  (2b) 


. 
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$ 

(A)  (A) 

and  therefore 


1  +  4Ny  -  2NyS  1  +  4N  y 


(3a) 


~  =  1  +  2NyS  =  1  +  2NyS 

f(A/A)  1  +  4Ny  -  2NyS  1  +  4Ngy 


(3b) 


where  N  =  (1-  %S)N.  It  can  be  verified  that  these  are  the  approximate 

e 

equilibrium  values  of  equations  (la)  and  (lb)  by  substitution  or  from 
the  theory  of  perturbed  matrices  (section  5.5,  Noble  and  Daniel, 

1977).  If  $  is  the  probability  that  two  genes  chosen  randomly  from 

A 

the  population  without  replacement  (not  necessarily  from  two  different 
individuals)  are  identical,  then 

$A  =  "2N-1  ^(A/A)  +  (1_  2N-1)$(A)  (A)  "  $(A)  (A) 


since  N  >>  1  (Cockerham,  1967). 

We  now  turn  our  attention  to  the  two-locus  model.  Denote  the  two 
loci  by  A  and  B,  and  let  r  be  the  recombination  value  between  them. 
Let  N  be  the  number  of  diploid  individals,  S  be  the  proportion  of 
selfing  and  ^  and  v  be  the  mutation  rates  to  unique  alleles  at  the  A 
and  B  loci,  respectively. 

Sixteen  identity  coefficients  are  required  to  describe  random 
drift  of  two  loci  in  a  finite  populatin  that  is  partially  selfing. 
These  identity  coefficients  involve  randomly  choosing  chromosomes 
without  replacement  from  one,  two,  three  or  four  different  individuals 
and  are  denoted  by  y,  $,  T  and  A  respectively.  The  following  notation 
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is  used  in  the  subscripts:  parentheses  are  used  to  separate  the  genes 
contributed  by  different  individuals,  and  slashes  are  used  to  separate 
the  genes  contributed  by  different  chromosomes  of  an  individual.  For 
example,  (A/B)  Pr°bability  identity  at  both  loci  if  the 
genes  at  the  A  and  B  loci  are  chosen  from  one  chromosome  of  one 
inidivual  and  from  different  chromosomes  of  another  individual.  If  the 
genes  on  the  two  chromosomes  of  an  arbitrary  individual  are  denoted  by 


a  ,b.,  and  a._b._,  respectively,  then  the  sixteen  identity 
ll  ll  i2  i2 

coefficients  are  given  in  Table  3.1. 


The 

sixteen 

Appendix 


recursion  relationships  for  the  expected  values 
identity  coefficients  over  replicate  populations  are 
2.  At  equilibrium, 


of  the 
given  in 


:  *  ^  (1-S3Wi 

(A/A)  ' 


1  -  hs 


!iS  +  (1-S)  4> 


(B/B) 


(B) (B) 


1  -  hs 


*aS  +  (l-s)t 


(AB/AB) 


(AB) (AB) 


1  -  !sS 


4> 


(AB) (A/B) 


^(AB)  (AB)  +  (1~S^F(ab)  (A)  (B) 

1  -  hS 


*(AB/B)(A) 


*(AB/A)(B) 
F  (B/B)  (A)  (A) 
r(A/A)(B)(B) 


!i*(A)(A)  + 

(1- 

S)r(AB) 

(A) 

(B) 

1  - 

*iS 

Js$(B)(B)  + 

(1- 

S>f(AB) 

(A) 

(B) 

1  - 

JsS 

^(A)  (A) 

♦  (1 

*s:ia(A) 

(B) 

(A) 

(B) 

1 

-  *sS 

!sS%b)(b) 

*  (1 

•s,‘w 

(B) 

(A) 

Ill 

(4) 


1  -  hs 


Table  3.1:  Definitions  of  identity  coefficients  for  a  partial 
selfing  population. 


^ (A/A)  "  P^il^i2) 


^(A)  (A)  =  P(ail-aji:) 


^  (B/B)  =  P{'hir*i2> 


H'(AB/AB)  “  P(ail“ai2  and  hilZhi2} 

$(AB)(AB)  =  P(ailHajl  and  b;i"V 

*(AB)(A/B)  =  P(ail=ajl  and 

^ (AB/B) (A)  =  P(ailEa7l  and  bilEbi2> 

$(AB/A)(B)  =  P(ailEa£2  and  bilEbjl^ 

*  (A/A)  (B/B)  =  P(ailEai2  and  bjl  EV 

*(A/B)(A/B)  =  P(a£l=a/2  and  hil~hj2i 

r(AB)(A)(B)  =  P(ailEajl  and  bil5bfcP 

F (B/B)  (A)  (A)  =  P(ajlEafel  and  bilEbi2^ 

F (A/A)  (B)  (B)  =  PCailEai2  and  hjl=hkl^ 

F (A/B)  (A)  (B)  =  P(ailEajl  and  hi2=hkl^ 

a(a)(b)(a)(b)  =  p(aiiEafci  and  bji EbzP 

The  genes  of  the  two  chromosomes  of  an  individual  are  denoted  by 
ailbil  and  ai2bi2 *  resPectively .  ("="  is  read  "is  identical  to"). 
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*(A/A)  (B/B) 


*(A/B)(A/B)  = 


:  .  ^5r  (AB)  (A)  (B)  *  (1~S)A(A)(B)(AMB) 

(A/B)  (A)  (B)  ' 


1  -  %S 


■,sVw  ,  js;g-s)(i(A1(A14(B)m)  .  (1-S)2(1. 

(1  ♦  *sS)(i  -  ^S)2 


^^^(AB)  (AB)  *  2s  (1~S:)r(ABHAHB)  (1~S)  (1  +  2Sj  A  (A)  W  (A)  (B) 

(1  ♦  5sS)(l  -  *5S)2 


from  (Al) .  Substituting  these  values  into  (A2)  gives  the 
values  of  the  identity  coefficients  ^(a)(A)’  ^(B)(B)’ 

r(AB)(A)(B)  and  4(A)(B)(A)(B)  aS  sh°“n  in  Table  3‘2’  ”here 


equi librium 
$(AB) (AB)  ’ 


U  =  N(l-  %S)u  =  Ngy 
V  =  N(l-  hS)  v  =  N0v 

R  =  N ( 1  —  S)  r  =  N(l-  hS)  (1-S)r/(1-  hS)  =  Nere 
The  equilibrium  values  of  the  other  identity  coefficients  are  obtained 
by  substituting  the  equilibrium  values  from  Table  3.2  into  (4). 


In  order  to  compare  these  results  for  a  partially  selfing 
population  to  the  equivalent  results  for  a  random  mating  population, 
it  is  necessary  to  define  five  further  identity  coefficients.  Three  of 
these  identity  coefficients  involve  choosing  two  chromosomes  at  random 
without  replacement  from  the  population.  One  coefficient  involves 
choosing  three  chromosomes  and  one  coefficient  involves  choosing  four 
chromosomes.  (The  chromosomes  are  not  necessarily  from  different 
individuals) .  If  an  arbitrary  chromosome  is  denoted  by  a_jb^  then  the 
five  identity  coefficients  are 


"A  =  P(aiEa2) 


Table  3.2:  Expected  equilibrium  values  of  the  identity 
coefficients  for  a  finite,  partially  selfing  population. 


28 


3 

pd" 

+ 


> 

pt 

+ 


PQ 


PQ 


<  e> 


<  •©! 


<  o 


ai 

+ 

OS 

vO 

cm 

+ 

> 

+ 

pd" 

LO 

+ 

CM 

os 

00 

+ 

os 

/ - \ 

> 
+ 
ZD 
' — ' 
vD 

+ 

CM 

/'-v 

> 

+ 

' — ^ 
O 
00 
+ 

CM 

os 

> 

+ 

ZD 

lo 
1— ( 
+ 
OS 

CM 

> 

+ 


00 

Pt 

+ 

ro 

> 

+ 


CM 

tO 

+ 

I  I 

CT> 

+ 

OS 

CM 

+ 

> 

+ 

00 

rH 

+ 

os 

^ — s 
> 
+ 
ZD 
w 
pd" 
+ 
CM 

> 

+ 

00 


lO 


CD 

+ 

OS 

CM 
+ 
^ — s 
> 
+ 


pt 

LO 

+ 

CM 

OS 

CO 

+ 

OS 

/~v 
> 
+ 
3 
s — ' 

lO 

c- 

+ 

CM 

/ — s 

> 

+ 

o 

00 

+ 

CM 

OS 

— N 
> 
+ 
ZD 

n — * 
vO 

i— ♦ 
+ 
OS 

CM 

✓-—s 

> 

+ 


CO 

Pf 

+ 

00 

> 

+ 


CM 

to 


> 

Pt 

+ 


3 

PT 

+ 


CT> 

+ 

OS 

lO 

CM 

+ 

✓ — \ 
> 
+ 
3 
v — ^ 

pt 

LO 

+ 

CM 

OS 

00 

+ 

CS 


+ 

vO 

+ 

CM 

/ - N 

> 

+ 

o 

00 

+ 

CM 

OS 

> 

+ 

lO 

(-H 

+ 

OS 

CM 

> 

+ 

'• — ^ 
00 
Pi" 
+ 

CO 

✓“V 

> 

+ 

' — •> 
CM 
tO 
+ 

1—1 

to 

+ 

> 

+ 

' — <- 
CM 

> 

3 

lO 


CTi 

+ 

OS 

lO 

CM 

+ 

> 

+ 


Pt 

LO 

+ 

CM 

OS 

00 

+ 

OS 

> 

+ 

3 

o 

c- 

+ 

CM 

> 

+ 

s — ' 

o 

00 

+ 

CM 

OS 

> 

+ 

lO 

r-H 

+ 

os 

CM 


+ 
ZD 
• — ' 
00 
Pt 
+ 
co 

cv 

> 

+ 


CM 

to 


> 

Pt 

+ 


3 

PT 

+ 


PQ 


+ 
OS 
lO 
CM 
+ 
✓ — s 
> 
+ 

Pf 

LO 

+ 

CM 

OS 

00 

+ 

OS 


+ 

lO 

+ 

CM 

/■ — s 
> 
+ 

v — - 
O 
00 
+ 
CM 
cS 

> 

+ 

ZD 

> _ ' 

M O 

T— ( 
+ 
os 

CM 

- N 

> 

+ 


00 

Pt 

+ 

CO 

> 

+ 


CM 

tO 

+ 

> 

3 

CM 

CO 


CT> 

+ 

OS 

LO 

CM 

+ 

> 

+ 

Pt 
LO 
+ 
CM 
OS 
00 
+ 
OS 
- — \ 
> 
+ 

^ _ -• 

lO 

r- 

+ 

CM 

> 

+ 

3 

O 
00 
+ 
CM 
OS 
^ — \ 
> 
+ 


lO 

rH 

+ 

os 

CM 
> — \ 
> 
+ 

^ _ ' 

00 

pf 

+ 

CO 

/—V 

> 

+ 

CM 

tO 


> 

pd" 

+ 


pd" 

+ 


<  P-. 


PQ 

v_x 

^ — N 

c 

— s 

PQ 

<s 


<  c 


29 


*AB  “  PCal=a2  and  W 
rAB  =  P(alEa2  and  W 
4AB  =  P(alEa3  and  W 


(Strobeck  and  Morgan,  1978) •  In  terms  of  the  previous  sixteen  identity 
coefficients , 


A 


AB 


$  =  — —  'f  .  +  fl-  — Lj 1*  ~  * 

A  2N-1  (A/A)  U  2N-1J*(A) (A)  (A) (A) 


$ 


‘’B  2N-1  y(B/B)  +  (1'  2N-1')'*’(B)  (B)  ~  4>(B)(B) 

AB  "  2N-1  ^ (AB/AB)  +  2N-1)$(AB) (AB)  =  $(AB)(AB) 


-1—  (t 


+  $ , 


+  $  , 


J 


+  m±r, 


AB  2N-1 1  (AB) (A/B)  (AB/A) (B)  (AB/B)(A)J  2N-1  (AB) (A) (B) 
_ 1  r*  .  -»  2N-4 

(2N-1)  (2N-3)  ^(A/A)(B/B)  "  (A/B)  (A/B)  >  +  (2N-1)  (2N-3)  ^(A/A)  (B)  (B)  + 

+  4r  )  +  (2N-4)(2N-6)  _  . 

(AB)  (A)  (B)'  (2N-1)  (2N-3)  Q(A)  (B)  (A)  (B)  "  a  (A)  (B)  (A)  (B) 


“  F  (AB)  (A)  (B) 
P (B/B)  (A)  (A) 


if  N  >>  1.  Therefore,  the  equilibrium  values  of  these  identity 

coefficients  are  as  given  in  Table  3.2  and  are  identical  to  those 

obtained  assuming  random  mating  with  a  population  size  N  =  (1-  i^S)N 

e 

and  a  recombination  value  r^  =  ( 1— S) r / ( 1 —  %S)  (Strobeck  and  Morgan, 
1978).  Therefore,  the  effect  of  partial  selfing  at  equilibrium  is  to 
reduce  the  population  size  by  a  factor  1-  and  the  recombination 
value  by  a  factor  (1— S)  /  (1—  J^S)  . 

There  is  a  simple  relationship  between  these  five  identity 
coefficients  and  the  quantities  used  by  Hill  (1975)  to  measure  the 


' 
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variation  of  linkage  disequilibrium  expected  in  a  finite  population 


(Serant,  1976;  Strobeck  and  Morgan,  1978).  If  p^  is  the  frequency  of 

the  i~th  allele  a^  at  the  A  locus,  q.  the  frequency  of  the  j~th  allele 

b.  at  the  B  locus,  and  f..  =  p.q.  +  D. .  the  frequency  of  the 

J  1 J  i  3 

chromosome  a.b.,  where  D. .  is  the  linkage  disequilibrium  between  a. 

l  j  ij  s'!  i 

and  b.,  then  the  expected  sum  of  squares  of  the  linkage  disequilibria 


E(ri  D  2)  =  _ 16UV[2(U+V)*l][4(lJ+V)+2R+S3 _ 

ij  1J  (1+4U) (l+4V)[32CU+V)3+48(U+V)2R+16(U+V)R2+80(U+V)2+76(U+V)R+8R2+54(U+V)+26R+9] 


and  the  squared  standard  linkage  disequilibrium 


E(ZZ  D. .2) 

2  =  ij  1J _  _  _ 4  (U+V)  +  2R+5 _ 

E  (  Z  Z  p.p  q.q  )  16 (U+V) 2  +  24 (U+V) R+8R2+32 (U+V)+26R+11 

i,k  j,l  1  k  J  1 


(Hill,  1975). 

In  Figures  3.1  and  3.2,  the 

equi librium 

values 

of 

E(ZZD2  ) 

ij  ij 

and 

2 

ad 

-3 

are  plotted  for  10  <  Nr 

3 

<  10  and  with 

Ny  = 

Nv  = 

0.25  and 

1.0 

and  S  = 

0.0,  0.5,  0.9,  0.99  and 

1.0.  It  is 

seen 

that 

E(E2d!.) 
IJ  IJ 

2 

and  a , 
d 

remain  significantly 

greater  than 

zero 

for 

increasingly 

larger 

values  of  Nr  as  S  approaches  one  and 

are 

not 

functions 

of 

the  recombination  value  if  S  = 

1.  If  r  =  0,  E ( 

«D2  ) 

•  •  *  •  ' 

IJ  IJ 

has 

a  maximum 

value  when 

U  =  V  0.505,  whereas  a,2  is  a 

d 

decreasing 

function 

of 

U+V. 

Therefore,  increasing 

the  proportion 

of  selfing 

2  .  zz  2 

increases  the  value  of  o,  ,  but  may  increase  or  decrease  E(..D..)  when 

d  1 3  iJ 


r  =  0.  Thus,  the  squared  standard  linkage  disequilibrium  is  probably 
the  better  measure  of  the  variation  of  linkage  disequilibrium  in  a 


finite  population  that  is  partially  selfing. 
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Figure  3.1:  The  expected  value  of  the  squared  linkage 
disequilibrium  for  Ny  =  Nv  =  0.25  and  1.0  and  with  partial 
selfing  at  a  rate  S  =  0.0,  0.5,  0.9,  0.99,  and  1.0  ( -  S  =  0.0, 


S  =  0.5,  ~ 


S  =  0.9,  - 


S  =  0.99, 


S  =  1.0) . 


E(ZD2) 


E(ZD2) 


10'3  10°  103 


Nr 


■ 
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Figure  3.2:  The  expected  value  of  the  squared  standard  linkage 
disequilibrium  for  Ny  =  Nv  =  0.25  and  1.0  and  with  partial 
selfing  at  a  rate  S  =  0.0,  0.5,  0.9,  0.99,  and  1.0  ( -  S  *  0.0, 


S  =  0.5,  - 


S  =  0.9,  ~ 


S  =  0.99, 


S  =  1.0)  . 
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Di s  cus  s i on 


The  results  in  the  previous  section  show  that  there  is 

significant  variance  in  the  expected  linkage  disequilibrium  due  to 

random  drift  in  a  partially  selfing  population  if 

N  r  =  N  ( 1 — S)  r  *  1 
e  e 

and  the  mutation  rates  y  and  v  are  of  the  order  1/N.  It  is,  therefore, 

appropriate  to  examine  the  experimental  data  collected  from 

populations  of  partially  selfing  plants  to  see  if  the  observed  linkage 

disequilibrium  can  be  explained  by  mutation  and  random  drift.  The 

magnitude  of  (l~S)r  will  be  used  as  an  indicator  of  whether  the 

observed  linkage  disequilibrium  could  be  due  to  random  drift.  Since 

-4  -8 

the  mutation  rate  is  generally  assumed  to  be  between  10  and  10  , 

4 

the  population  size  must  be  larger  than  approximately  10  if  the 

variation  is  to  be  maintained  in  the  population.  Therefore,  (l~S)r 

-4 

must  be  less  than  10  before  the  observed  linkage  disequilibrium  is 
likely  to  be  the  result  of  random  drift. 


In  barley,  Horedeum  vulgare ,  Allard  and  his  co-workers  (Allard, 

Kahler  and  Weir,  1972;  Weir,  Allard  and  Kahler,  1972,  1974)  found 

significant  linkage  disequilibrium  between  four  esterase  loci  in 

Composite  Cross  V.  Three  loci,  A,  B  and  C,  are  closely  linked  and  the 

fourth  locus  is  unlinked  to  the  other  three.  The  recombination  value 

between  the  three  linked  loci  are  estimated  to  be  r  =  0.0023, 

AB 


AC 


BC 


=  0.0059  (Kahler  and  Allard,  1970).  The  estimate 


=  0.0048  and  r 
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of  the  proportion  of  selfing  is  S  =  0.9943  (Allard,  Kahler  and  Weir, 
1972).  Therefore,  the  value  of  (l~S)r  between  AB,  AC  and  BC  are 
0.000013,  0.000027  and  0.000034,  respectively.  These  values  are  in  the 
range  such  that  linkage  disequi 1 ibria  could  be  generated  by  random 
drift.  However,  since  Composite  Cross  V  was  initiated  in  1941,  a 
transient  analysis  is  more  appropriate  than  the  comparison  of  the 
observed  sum  of  squares  of  the  linkage  disequi 1 ibria  or  the  squared 
standard  linkage  disequilibrium  to  that  expected  at  equilibrium. 

Also,  the  linkage  disequilibrium  between  six  loci,  four  esterase 
loci  Ej ,  ,  Eg  and  Ejg,  a  phosphatase  P5 ,  and  an  anodal  peroxidase 
APX5 ,  has  been  analyzed  in  Avena  barbata,  the  slender  wild  oat,  by 
Allard  et^  al .  (1972).  Three  loci,  PgjAPXg  and  E]_g  are  linked,  and  the 

recombination  values  are  r  =  0.04,  r.-^.  „  =  0.23  and 

P5-APX5  apx5-e10 

r_  =  0.25  (Marshall  and  Allard,  1969).  The  proportion  of  selfing 

p5“e10 

has  been  estimated  to  be  approximately  S  =  0.98  (Marshall  and  Allard, 
1970;  Hamrick  and  Allard,  1972).  Therefore,  the  smallest  value  of  (l- 
S)r,  which  is  between  P5  and  APX5,  is  0.0008.  This  value  is  small 
enough  that  random  drift  might  have  a  significant  effect  if  the  size 
of  the  effective  population  is  relatively  small,  but  the  actual 
population  size  was  estimated  to  be  approximately  50,000. 

These  two  examples  show  that  random  drift  might  explain  some  of 
the  linkage  disequilibrium  observed  in  natural  populations.  However, 
random  drift  is  unlikely  to  be  the  cause  of  the  observed  linkage 
disequilibrium  between  loosely  linked  loci. 


■ 
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Summary 


The  variation  of  linkage  disequilibrium  expected  in  a  finite, 
partially  selfing  population  is  analyzed,  assuming  the  infinite  allele 
model.  Formulas  for  the  expected  sum  of  squares  of  the  linkage 
disequil ibria  and  the  squared  standard  linkage  disequilibrium  are 
derived  from  the  equilibrium  values  of  sixteen  identity  coefficients 
required  to  describe  the  behavior  of  the  system.  These  formulas  are 
identical  to  those  obtained  with  random  mating  if  the  effective 
population  size 

N  =  (1-  %S)N 
e 

and  the  effective  recombination  value 

r  =  (1-S)r/(1-  %S) 
e 

where  S  is  the  proportion  of  selfing,  are  substituted  for  the 
population  size  and  the  recombination  value.  Therefore,  the  effect  of 
partial  selfing  at  equilibrium  is  to  reduce  the  population  size  by  a 
factor  1-  %S  and  the  recombination  value  by  a  factor  (1— S) / (l-  hS) . 


Chapter  4 


Increased  Effective  Number  of  Alleles  Found  in  Hybrid 
Populations  due  to  Intragenic  Recombination 


Introduction 


Hybridization  is  recognized  as  a  common  feature  of  many  natural 
populations  and  has  numerous  implications  in  the  study  of  speciation. 
One  unusual  feature  of  hybrid  populations  is  the  presence  of  alleles 
which  do  not  exist  in  either  of  the  parental  populations.  These  unique 
alleles  have  been  observed  by  Hunt  and  Selander  (1973)  in  Mus  musculus 
musculus  and  M.  m.  domes t i cus  hybrids  and  by  Sage  and  Selander  (1979) 
in  Rana  ber landi eri  and  R.  utri culata  hybrids.  Two  explanations  have 
been  proposed  to  explain  their  presence:  These  unique  alleles  may  be 
due  to  increased  mutation  rates  in  hybrids  (Thompson  and  Woodruff, 
1978)  or  they  may  be  the  result  of  intragenic  recombination  between 
different  alleles  of  the  parental  populations  (Watt,  1972).  Ohno  et^ 
al  ♦  (1969),  McCarron  e_t  a_l.  (1974),  Freeling  (1976),  Koehn  and  Eanes 
(1976) ,  Morgan  and  Strobeck  (1979) ,  and  Tsuno  (1981)  have  observed 
patterns  of  variability  which  they  attribute  to  intragenic 
recombination. 

In  order  to  determine  if  intragenic  recombination  can  explain  the 
presence  of  these  rare  alleles,  we  have  constructed  a  model  to 
determine  the  amount  of  variability  expected  in  a  finite  hybrid 
population.  The  hybrid  population  is  assumed  to  consist  of  individuals 
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from  each  parental  population  which  can  mate  either  with  individuals 

from  the  same  or  from  the  opposite  parental  population.  Therefore,  the 

model  used  is  one  with  two  semi-isolated  populations  which  exchange  a 

proportion  of  their  genes  each  generation.  To  allow  for  the 

possibility  of  intragenic  recombination,  the  genes  are  assumed  to 

consist  of  two  sites  or  parts.  It  has  been  shown  that  both  intragenic 

recombination  and  population  subdivision  can  increase  the  variability 

maintained  in  a  finite  population.  Intragenic  recombination 

significantly  increases  the  effective  number  of  alleles  whenever 

Nr  >  1  and  r  >  U  (where  N  is  the  population  size,  y  is  the  mutation 

rate  to  neutral  alleles  and  r  is  the  recombination  rate  between  two 

sites  within  the  gene)  (Strobeck  and  Morgan,  1978).  Subdivision  of  a 

population  can  also  increase  the  effective  number  of  alleles  in  the 

total  population  (of  size  2N)  because  a  different  group  of  alleles  is 

maintained  in  each  of  the  subpopulations,  although  each  subpopulation 

* 

(of  size  N)  has  reduced  variability  (Malecot,  1948).  Nei  and  Feldman 
(1972)  and  Chakraborty  and  Nei  (1974)  have  also  studied  gene 
differentiation  and  rates  of  change  of  homozygosity  in  a  subdivided 
population  (for  a  review  see  Felsenstein,  1976;  Maruyama,  1977). 

It  is  shown  here  that,  at  equilibrium,  the  combination  of 
intragenic  recombination  and  population  subdivision  increases  the 
effective  number  of  alleles  maintained  in  a  population  beyond  the  sum 
of  the  effects  of  each  process  alone.  This  effect  is  greatest  when  the 
recombination  value  is  large  and  hybridization  (migration)  occurs  at 
an  intermediate  rate.  The  transient  behavior  of  the  system  shows  that 
sympatry  of  two  previously  isolated  populations  can  increase  the 
effective  number  of  alleles  maintained  in  each  population  and  in  the 
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hybrid  population  above  their  equilibrium  values  for  long  periods  of 
time  after  the  beginning  of  hybridization.  These  results  imply  that 
intragenic  recombination  may  be  the  cause  of  the  observed  unique 
alleles  in  hybrid  populations. 


Theory 


Let  the  population  consist  of  two  semi-isolated  subpopulations 
with  and  N2  diploid  individuals.  Each  generation,  the  i~th 
subpopulation  receives  a  proportion  l~m.  of  its  genes  from  itself  and 
a  proportion  m^  from  the  other  subpopulation.  The  gametic  migration 
considered  here  is  equivalent  to  individual  migration  for  the 
parameter  values  of  interest  to  this  study.  Each  gene  is  assumed  to 
consist  of  two  sites  or  parts,  denoted  a  and  b,  which  recombine  with  a 
probability  of  r.  Both  site  a  and  site  b  can  mutate  to  unique, 
selectively  neutral  forms  or  "alleles"  (as  in  the  infinite  alleles 
model  of  Kimura  and  Crow,  1964) .  Let  this  mutation  rate  per  gamete  per 
generation  be  for  site  a  and  v2  f°r  site  b.  Therefore,  the  mutation 
rate  of  the  gene  is  y  =  +  V2  per  gamete  per  generation. 


Malecot  (1948),  Maruyama  (1970)  and  Nei  and  Feldman  (1972)  showed 


that  the  behavior  of  a  single  locus  in  such  a  subdivided  population 
can  be  described  by  three  identity  coefficients  (denoted  by 


¥ 


¥ 


and  .  .  .  is  defined  as 

(a)2(a)2  (a)i(a)i 


(a) i (a) !  (a)i(a)2 

the  probability  that  two  genes,  chosen  randomly  without  replacement 


from  subpopulation  i, 


are 


identical  (that  is,  both  genes  carry  the 
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same  allele)  and  '^(a)1(a)  is  defined  as  the  probability  that  a  gene 
chosen  from  subpopulation  1  is  identical  to  a  gene  chosen  from 
subpopulation  2.  The  recursion  relationships  for  the  expected  values 

of  \ah(ah'  '1’(a)i(a)2  and  'f(a)2(a)2  over  replicate  populations 
are 


t(a)1(a)1 


(1-V1)  [(1-mj)  [^  +  (I"  2Sr)'r(a)l(a)l]  +  2mi(1-mi)'1'(a)1(a)2 


^(a) j (a) 2 


(1-vi) 


(l-ni)(l-m2)»(a)i(a)2  +  m1(l-m2)[^+  (1-  2(a) J 


+  m2(l-mi)  [2Nj  +  (1-  2N1)'t'(a)1(a)1J  +  “^(a)  j  (a)  2 


V)2(a)2  "  »-vi> 


(1  m2)  ^2N2  +  (1  2N2)'1'(a)2(a)2^  +  2“2(1  m2)  ’‘'(a)  j  (a)  2  +  +  (1~  2N1)'1'(a)  :  (a)  J 


(where  the  '  indicates  the  value  of  the  coefficient  in  the  next 
generation).  These  same  relationships  hold  true  for  a  single  site 


within  a  gene  and  those  for  a  second  site,  (  'F 


(b) ! (b) ! * 


¥ 


(b) i  (b)  2 


and  'F.  .  /  .  ),  are  also  the  same  but  with  Vi  replaced  by  \>2  • 

Cb;2Cb;2 


This  approach,  using  identity  coefficients,  can  be  extended  to 
consider  two  linked  sites  (within  a  gene)  in  two  subpopulations. 
Although  genes  actually  consist  of  many  sites  the  consideration  of  two 
sites  is  an  appropriate  model  to  indicate  if  intragenic  recombination 
has  any  significant  effect.  The  widespread  occurence  of  introns  in 
eukaryotic  genes  (Gilbert,  1978;  Crick,  1979)  facilitates 
recombination  between  exons  and  this  model  is  a  fairly  accurate 
representation  of  a  gene  with  a  single  intron  (the  sites  being 
identified  with  the  exons) . 


To  describe  the  behavior  of  this  system  from  one  generation  to 
the  next  requires  26  identity  coefficients  (each  the  probability  that 
a  particular  sample  of  genes,  picked  at  random  without  replacement, 
are  identical  at  the  a  and/or  b  sites).  The  symbol  T  is  used  to 
designate  the  probability  that  two  gametes  have  identical  a  (or  b) 
sites.  The  symbols  $,  T  and  A  are  used  to  designate  probabilities  of 
identity  at  both  the  a  and  b  sites  chosen  from  two,  three  and  four 
gametes  respectively.  Each  symbol  is  subscripted  to  indicate  how  the 


gametes  are  chosen.  The  coefficient 
defined  as 


/  i_ \  ^  for  example,  is 
(ab) i (ab) 2 


,  N  /  ,  \  =  Prob(  a...=a.0  and  b  . ..  =b  .  0  ) 
(ab) i (ab) 2  v  ll  j2  il  ]2 


where  an  arbitrary  gamete  chosen  from  the  first  subpopulation  is 
denoted  by  a_^b  ,  an  arbitrary  gamete  chosen  from  the  second 
subpopulation  is  denoted  by  a^2^j2  anc*  w^ere  "  =  "  should  be  read  "is 
identical  to".  The  definitions  of  the  26  coefficients  are  given  in 
Table  4.1. 

Complete  recursion  relationships  for  these  26  coefficients  have 

been  derived  and  if  N.  >>  1,  v.,r,m.  =  0(1/N.)  and  terms  of  0(1/N.2) 

1111  i 

are  neglected  the  recursion  relationships  for  the  expected  values  of 
the  coefficients  over  replicate  populations  simplify  to  the  form  given 
in  Appendix  4.  Equations  similar  to  those  given  here  were  developed  by 
Serant  (1974)  with  N1  =  N2 ;  although  no  analysis  of  the  equations  was 
presented. 


Chakraborty  and  Nei  (1974)  found  that  the  equilibrium  and 
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transient  values  of  homozygosity  for  a  single  locus  with  two 
populations  of  size  N]_  and  N2  differ  little  from  those  with  two 
populations  each  of  size  N  =  (N1+N2)/2.  We  also  find  that  solutions  to 
the  equations  are  not  changed  qualitatively  when  it  is  assumed  that 
N 1  =  N2  =  N,  vi  =  v2  -  v  and  m^  =  m2  =  m.  These  assumptions  greatly 
simplify  the  equations  at  equilibrium  since  not  all  26  coefficients 
are  then  required.  The  number  of  necessary  coefficients  is  reduced  to 
11  since 


\a)1(a)1  "  V(a)2(a)2  =  'i'(b)1(b)1  Y(b)2(b)2 

^(a) i(a)2  ^(b) 1 (b) 2 

<!>(ab)1(ab)1  ^(ab) 2(ab) 2 

HCabhUhOOj  =  r(ab)2(a)2(b)2 

r(ab) j (a) j (b)2  r(ab) j (a) 2 (b) j  r (ab) 2 (a) 2 (b) j  r (ab) 2 (a) ! (b) 2 

r(ab)1(a)2(b)2  "  r(ab)2(a)!(b)i 
A(a)1(b)1(a)1(b)1  “  A(a)2(b)2(a)2(b)2 

(a)  1  (b)  j  (a)  j  (b)  2  A(a)  j  (b)  j  (a)  2  (b)  j  A  (a)  j  (b)  2  (a)  2  (b)  2  \a)2(b)  j  (a)2(b)2 
A(a)1(b)2(a)1(b)2  =  A(a)2(b)1(a)2(b)1 

Eleven  coefficients  were  used  to  determine  the  equilibrium  values  of 
the  identity  coefficients.  If  the  coefficients  initially  satisfy  the 
above  equalities  and  the  above  assumptions  are  met  then  the 
coefficients  will  satisfy  these  equalities  for  all  time.  Therefore, 
only  these  11  coefficients  were  also  used  to  study  the  transient 
behavior.  The  equilibrium  identity  coefficients  were  found  by  solving 
the  system  of  11  linear  equations  with  particular  values  of  the 
mutation,  migration  and  recombination  rates.  The  values  of  these 
parameters  were  chosen  to  illustrate  the  qualitative  form  of  the 
results  that  will  be  obtained.  The  transient  results  were  obtained  by 


iteration. 


45 


Results  and  Discussion 


The  coefficient  ^(ab)  (ab)  the  exPected  homozygosity  in 

subpopulation  1  of  a  gene  consisting  of  two  sites  (each  with  a 

mutation  rate  v)  .  The  expected  homozygosity  in  subpopulation  2, 

($/  ,  s  ,  ,  N  )  is  the  same  as  in  subpopulation  1  because  of  the 
(ab)  2  (ab)  2 

assumptions  of  equal  mutation  rates,  migration  rates  and  population 

sizes.  The  effective  number  of  alleles,  a  measure  of  variability, 

within  each  subpopulation  is  therefore  n  =  1/$,  ,  N  x  .  If  the 

e  (ab)  i  (ab)  x 

subdivision  is  known  to  an  observer,  n  is  the  appropriate  measure  of 

e 

variability  in  a  subdivided  population.  However,  in  many  cases  the 
population  may  be  spatially  or  ethologically  subdivided  and  not 
recognized  as  such  by  an  observer.  In  this  case  genes  would  be  sampled 
randomly  from  each  group  and  the  expected  homozygosity  would  be 


N^Nx-l)  2NjN2  N2(N2-1) 

(Ni+N2) (N]+N2-l)  ^(ab) i(ab) i  +  (Nj+N2) (N]+N2-l)  ^(ab) i (ab) 2  +  (Nj+N2) (N]+N2-l)  ^(ab)2(ab)2 


and  the  appropriate  effective  number  of  alleles  would  be 


ne  Ni(Ni-l) 


2NjN2 


N2(N2-1) 


(Ni+N2) (N:+N2-l)  ^(ab) j (ab) ]  +  (Nx+N2)  (N^t^-l)  ^(ab) x (ab) 2  +  (Nx+N2) (Ni+N2-1)  (ab)2(ab)2 


“Z 

Ni  2NXN2  N2 

7577  *<»»>. <»»»■ +  7577  %b)‘(ab)2 +  7577  *(ab>2(ab>2 
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since  ,N2  >>  1.  With  the  above  assumptions 

n*  _  _ 1 _ 

6  ^^(ab)1(ab)1  +  ^ (ab) 1 (ab) 2 

In  a  hybrid  population  between  two  races  or  species  the  genes  would  be 

■k 

sampled  at  random  from  each  group  and  therefore  ng  should  be  used  as 
the  measure  of  variability. 

The  effective  number  of  alleles  at  equilibrium  in  a  single 

-3 

subpopulation,  n  ,  is  given  in  Figure  4.1  for  4Np  =  2.0  ,  10  <  4Nr  < 

3  -3  6  3 

10  and  10  <  4Nm  <  10  .  It  shows  that  migration  (hybridization)  and 

recombination  each  significantly  increases  the  number  of  alleles 
maintained  in  a  population  when  4Nm  >  1  or  4Nr  >  1.  When  both 

migration  and  recombination  occur  together  the  effective  number  of 
alleles  is  increased  beyond  the  sum  of  the  increases  due  to  each 
process  alone.  This  is  because  recombination  requires  initial 
variability  to  be  present  before  it  can  generate  more  variability.  The 
migration  introduces  new  alleles  which  can  then  recombine  with  other 
alleles.  It  can  be  seen  that  the  change  in  n^  between  low  and  high 

amounts  of  both  migration  and  recombination  is  very  large  (an  ng  of  3 
versus  9).  Hence,  the  effective  number  of  alleles  in  a  natural 
population  can  be  very  large  due  to  just  these  two  processes,  when 
both  recombination  and  migration  rates  are  sufficiently  large.  When 
the  mutation  rate  is  smaller  than  4Ny  =  2.0,  the  graphs  show  the  same 

behavior  but  to  a  lesser  degree.  From  studies  on  three  sites  it  can  be 

anticipated  that  as  more  sites  are  added  to  a  gene  the  variability 
further  increases  when  both  migration  and  recombination  occur. 


Figure  4.1:  The  equilibrium  effective  number  of  alleles 

maintained  in  each  of  the  two  subpopulations  (4Ny  =  2.0). 
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In  a  hybrid  population  the  effective  number  of  alleles  at 
•  •  •  * 

equilibrium  is  n  .  This  quantity  (plotted  in  Fig.  4.2  for 

-3  3-3  3 

10  <  4Nm  <  10  ,  10  <  4Nr  <  10  and  4Ny  =  2.0  (2a)  or  4Ny  =  1.0 
(2b))  is  quite  diff  erent  from  the  effective  number  of  alleles  in  one 
subpopulation.  Again,  smaller  mutation  rates  show  the  same  relative 
effects  as  in  Fig.  4.2b)  but  the  effect  is  smaller  in  magnitude.  There 
is  an  optimum  when  recombination  is  large  and  hybridization  occurs  at 
an  intermediate  rate.  Since  each  subpopulation  maintains  a  different 
array  of  alleles,  hybridization  can  introduce  new  alleles  to  one 
population  which  can  then  recombine  to  create  new  variability.  These 
recombinants  are  combinations  which  do  not  exist  in  either  parental 
population  and  would  appear  as  alleles  unique  to  the  hybrid 
population.  Without  recombination  the  variability  introduced  by 
hybridization  is  already  present  in  the  total  population  and  does  not 

•Jj 

increase  n  .  Intragenic  recombination  can,  therefore,  account  for  the 
e 

unique  alleles  seen  in  hybrid  populations  and  perhaps  for  some  of  the 
mutants  found  in  hybrid  dysgenesis  (Thompson  and  Woodruff,  1978). 

The  decrease  on  either  side  of  the  optimum  as  the  hybridization 
rate  changes  is  the  result  of  several  factors.  First,  new  combinations 
will  be  created  only  if  one  of  the  a  and  b  sites  which  recombine  are 
of  a  form,  or  "allele",  which  does  not  exist  in  one  of  the 
subpopulations.  Therefore,  as  the  rate  of  hybridization  increases  and 
the  similarity  of  alleles  from  each  subpopulation  increases,  the 
chance  that  recombinants  will  be  new  alleles  decreases.  On  the  other 
hand,  a  very  small  rate  of  hybridization  does  not  introduce  sufficient 
numbers  of  genes  for  intragenic  recombination  to  be  effective. 


Figure  4.2:  The  equilibrium  effective  number  of 


alleles 


maintained  when  gametes  are  sampled  at  random  from  each 
subpopulation  (4.2a  4Ny  =  2.0,  4.2b  4Ny  =  1.0). 
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It  has  been  shown  by  Malecot  (1948)  that  the  effective  number  of 
alleles  in  the  entire  population  increases  as  the  migration  rates 
between  the  subpopulations  decrease.  As  shown  in  Figure  4.2  this  is 
not  always  true  when  intragenic  recombination  occurs.  In  a  population 
with  no  recombination  and  no  migration  between  subpopulations, 


n^  =  2+8Ny , 

while  in  a 

population 

with 

large 

migrat i on 

rates , 

* 

n  =  l+8Ny. 

Therefore , 

for  fixed 

4Ny 

and  no 

recombinat 

lion,  a 

population  with  no  migration  between  subpopulations  always  has  more 
variability  than  with  free  migration.  However,  when  the  recombination 
value  is  high, 


* 

n 

e 


— ( — - — 
4^1+2Ny 


■>  +i(  0  >  +i(irk) 


2  +  8Ny  +  8N2y2 


in  a  population  with  no  migration  between  subpopulations.  In  a 
population  with  free  migration 


* 

n 

e 


— ( — - — 
4vl+4Ny 


)  + 


2  l+4Ny 


)  +tC 


4  l+4Ny 


) 


1  +  8Ny  +  16N2y2 


Thus,  when  the  recombination  value  is  high,  a  population  with  free 
migration  between  subpopulations  will  have  more  variability  when 
4Ny  >  /2  -  1.41.  This  difference  is  because  the  amount  of  variability 
intragenic  recombination  can  create  is  a  non-linear,  increasing 
function  of  the  variability  already  present. 


It  has  been  suggested  that  some  hybrid  populations  may  be  stable 
over  long  periods  of  time  (Mayr,  1963,  pp  368“379;  Short,  1972;  Hunt 
and  Selander,  1973).  In  this  case  the  equilibrium  analysis  above  is 
appropriate.  However,  if  the  hybrid  population  is  relatively  recent  it 
is  necessary  to  consider  the  transient  behavior.  In  order  to 
investigate  how  the  effective  number  of  alleles  changes  over  time  in  a 
non-equilibrium  population,  it  is  appropriate  to  assume  that  the  two 
subpopulations  initially  are  at  equilibrium  with  no  hybridization,  ie: 
the  equilibrium  values  of  the  coefficients  when  mi  =  m2  =  0. 
Hybridization,  at  a  constant  rate,  is  then  introduced  and  the  change 
in  the  value  of  the  coefficients  over  time  is  followed  (again  mutation 
rates,  migration  rates  and  population  sizes  are  assumed  to  be  equal 

for  each  of  the  subpopulations).  Figure  4.3  shows  the  results  with  n^ 

4 

plotted  for  N  =  10  ,  4Ny  =  2.0,  4Nr  =  0.1,  1.0,  10.0  and  4Nm  =  10.0. 
The  abscissa  gives  the  number  of  generations  starting  at  generation  #0 
(each  subpopulation  at  equilibrium  with  no  hybridization)  and  each 
generation  up  to  #10,000.  The  equilibrium  that  will  eventually  be 
reached  is  indicated  by  an  arrow.  For  all  values  of  4Nr,  the  transient 
value  of  n^  (the  effective  number  of  alleles  within  one  subpopulation) 
shows  an  increase  above  the  eventual  equilibrium  as  hybridization  and 
recombination  introduce  new  alleles  to  the  subpopulation  and 
thereafter  a  slow  decline  as  random  drift  eliminates  alleles.  With 
increasing  values  of  4Nr  the  difference  between  the  maximum  ng  and  the 
eventual  equilibrium  becomes  larger.  In  all  cases  the  equilibrium  is 
above  the  initial  value  and  the  transient  ng  becomes  higher  still. 

•k 

Figure  4.4  gives  the  results  for  ng  for  the  same  parameter  values 


. 

' 

• 


Figure  4.3:  The  effective  number  of  alleles  in  each  subpopulation 

after  two  isolated  populations  begin  hybridization  with  each 

4 


other  (4Np  =  2.0,  4Nm  =  10.0,  N  =  10  ) . 
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as  Figure  4.3.  For  a  small  4Nr,  n  goes  straight  toward  a  lower 

e 

equilibrium  value  (though  very  slowly);  whereas  with  a  large  4Nr,  n 

overshoots  its  equilibrium  value  which  is  larger  than  the  initial 
.  *  . 

value.  Since  n^  is  a  measure  of  the  total  population  variability,  if 

the  two  subpopulations  were  to  instantaneously  mix  there  would  be  no 
.  * 

change  in  n  .  Therefore,  initially  the  only  new  variability  in  the 
e 

total  population  is  created  by  recombination.  Without  this 

recombination,  new  alleles  would  be  formed  only  by  mutation. 

Figures  4.3  and  4.4  show  that  the  effects  of  past  events  will  be 

retained  in  a  population  for  a  very  large  number  of  generations.  This 

makes  the  utility  of  an  equilibrium  analysis  questionable,  because  the 

approach  to  such  an  equilibrium  is  very  slow.  For  example,  when 

* 

4Nm  =  1.0  and  4Nr  =  10.0,  n^  is  still  increasing  away  from  the 
equilibrium  even  after  10,000  generations  have  passed.  Similarly,  n 

e 

must  exceed  the  equilibrium  before  returning  to  it  many  generations 

later;  however,  when  4Nm  =  1.0  and  4Ny  =  2.0,  ng  has  not  yet  increased 

up  to  the  equilibrium  after  10,000  generations.  Other  hybridizations 

or  events  of  importance  to  the  population  are  certain  to  occur  before 

an  equilibrium  is  reached.  Also  the  effects  of  hybridization  can 

actually  be  larger  in  a  recent  hybrid  population  than  those  expected 

in  an  equilibrium  population.  In  general  n  increases  above  the 

e 

equilibrium,  though  the  length  of  time  is  prohibitive  when  4Nm  is 
k 

small  and  n^  also  increases  above  the  equilibrium  when  4Nr  is  large. 

It  may  be  difficult  to  observe  the  initial  effects  of  hybridization  on 

k 

population  parameters  because  n^  changes  slowly.  From  Figure  4.4  it 

* 

can  be  seen  that  n^  initially  changes  asymptotically.  Indeed,  although 

k 

the  potential  for  large  changes  in  ng  is  present,  little  difference 


would  be  noticeable  after  125  generations. 


This  study  indicates  that  homozygosity  is  not  an  effective  way  to 
determine  if  intragenic  recombination  is  an  important  factor  in 
creating  new  alleles  in  a  hybrid  population.  This  determination  might, 
however,  be  done  at  the  molecular  level  by  sequencing  the  DNA.  If  one 
part  of  the  gene  had  a  sequence  characteristic  of  one  subpopulation 
and  another  region  of  the  gene  was  characteristic  of  the  other 
subpopulation  then  intragenic  recombination  would  be  indicated. 


Summary 


A  two  site,  infinite  allele  model  is  used  to  study  the  influence 
of  intragenic  recombination  on  the  effective  number  of  neutral  alleles 
in  a  hybrid  population.  It  is  shown  that  the  combination  of  intragenic 
recombination  and  hybridization  can  have  a  large  effect  on  the 
effective  number  of  alleles  in  a  population  at  equilibrium  and  an  even 
larger  effect  when  the  population  is  .not  at  equilibrium.  When  the 
mutation  and  recombination  rates  are  large,  a  completely  subdivided 
population  will  not  maintain  as  much  variability  as  a  random  mating 
population.  It  is  concluded  that  unique  alleles  in  hybrid  populations 
could  be  formed  by  intragenic  recombination. 


Chapter  5 

Variance  and  Covariance  of  Homozygosity  in  a  Structured  Population 


Introduct i on 


The  amount  of  homozygosity  is  a  basic  measure  of  variability  in  a 
natural  population.  The  expected  homozygosity  in  a  finite  population 
with  selectively  neutral  alleles  was  first  determined  by  Haldane 
(1939)  and  independently  by  Malecot  (1948)  and  by  Kimura  and  Crow 
(1964).  To  interpret  observed  levels  of  homozygosity  it  is  also 
necessary  to  known  the  expected  variance.  The  variance  of  homozygosity 
in  a  finite  population  with  mutation  and  with  selectively  neutral 
alleles  was  determined  by  Watterson  (1974)  and  by  Stewart  (1976),  and 
the  transient  behavior  by  Li  and  Nei  (1975). 

A  method  to  derive  the  variance  of  homozygosity  is  developed 
here,  using  identity  coefficients.  The  method  is  applied  to  derive  the 
variance  and  covariance  of  homozygosity  for  a  structured  population. 
The  population  is  assumed  to  be  divided  into  n  partially  isolated 
subpopulations  each  with  N  diploid,  randomly  mating  individuals.  The 
variance  of  homozygosity  for  completely  isolated  populations 
(including  it's  decomposition  into  component  parts)  has  been  derived 
by  Lessard  (1981) .  The  variance  of  homozygosity  within  each 
subpopulation,  the  variance  of  homozygosity  when  gametes  are  sampled 
at  random  from  the  subpopulations  and  the  covariance  of  homozygosity 
between  two  subpopulations  are  examined.  The  results  are  compared  with 
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those  expected  for  a  single  random  mating  population. 


Variance  of  Homozygosity  in  a  Single  Population 


The  variance  of  homozygosity  is  first  derived  for  a  single 
population  and  the  derivation  is  then  extended  to  a  structured 
population. 

Consider  a  locus  (denoted  A)  which  can  have  k  possible  alleles. 
Let  the  population  consist  of  N  randomly  mating  diploid  individuals. 
Each  generation  alleles  can  mutate  to  any  particular  allele  at  a  rate 
per  gamete  per  generation.  Thus  the  total  mutation  rate  of  an 
allele  to  any  other  allele  is  y  per  gamete  per  generation. 


When  N  >>  1,  the  expected  "homozygosity"  is 


E( 


where  p.  is  the  frequency  of  the  i~th  allele  and  the  variance  of 
homozygosity  is 

k.£  k  2  2  k  2  2 

Var(  l  p  )  =  E[(  l  p  )  ]  -  E(  l  p  ) 
i=l  i=l  i=l 


k4  k  k  2  2  k  2  2 

=  E(  l  p  )  +  E(  l  l  p  p  )  -  E (  l  p  ) 
i=l  i=lj=l  J  i=l 

j+i 


When  N  >>  1  and  neglecting  terms  of  order  1/N,  the  expected 
homozygosity  is  equal  to  the  probability  that  two  gametes,  sampled  at 
random  without  replacement,  have  the  same  allele  at  locus  A  (Kimura 
and  Crow,  1964).  This  probability  can  be  denoted 


The  terms  in  the 
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formula  for  the  variance  of  homozygosity  can  also  be  expressed  as 


K. 

identity  coefficients.  The  term  E(  J  p.)  is  approximately 

4  -  1  1 


the 


probability  that  four  gametes  sampled  without  replacement  carry 
alleles  at  locus  A  which  are  identical  in  state  and  can  be  denoted 


k  k 

l  l 


2  2 


The  term  E(  ±“ -l j p^p.)  is  approximately  the  probability  that  of 
four  gametes  sampled  without  replacement,  two  pairs  have  the  same 
allele  and  each  pair  have  different  alleles.  This  probability  can  be 


denoted  A^/^.  Thus  the  variance  of  homozygosity  can  be  expressed  as 


Var(  J  pj)  =  A  +  A  - 
1=1 

To  determine  recursion  relationships  for  these  coefficients  it  is 
convenient  to  define  another  six  identity  coefficients.  Throughout, 
the  coefficients  are  denoted  by  $  if  they  involve  a  sample  of  two 
gametes,  by  T  if  they  involve  a  sample  of  three  gametes  and  by  A  if 
they  involve  a  sample  of  four  gametes.  A  slash  is  used  to  separate 
non-identical  genes;  and  subscripts  denote  the  subpopulation  from 
which  the  alleles  were  chosen.  Denoting  the  allele  on  an  arbitrary 
gamete  by  a_^,  the  nine  coefficients  can  be  defined  as  in  Table  5.1. 

Recursion  relationships  for  the  expected  values  of  the 
coefficients  over  replicate  populations  are  given  in  Appendix  4. 
Although  these  nine  coefficients  help  to  make  the  derivation  more 
obvious,  only  four  of  them  are  necessary  since 


rill  +  3ri 1/1  +  ri/l/l  =  1 

Allll  +  4A11 1/1  +  3A1 1/11  +  6A11/1/1  +  Al/1/1/1  =  1 

rill  +  ril/l  =  $11 
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Allll  +  A11 1/1  ”  rlll 
A11 1/1  +  All/ll  +  A1 1/1/1  =  ril/l 


When  1/N,  y 
relationships 
equi librium, 


<<  1  and  using 
reduce  to  thos 
with  A  =  ^Ny^-ip 


e 


these 

shown 


identities,  the  recursion 
in  Table  5.2.  Therefore  at 


1+A 
( 1+kA) 


P  =  (1+A)  (2+A) 

111  ( 1+kA) (2+kA) 


t  =  ( 1+A)  ( 2+A)  ( 3+A) 

1111  (1+kA) (2+kA) (3+kA) 


A  =  A (1+A) 2(k-l) 

11/11  (1+kA) (2+kA) (3+kA) 


Var(EP1) 


2A( 1+A) (k-1) 

( 1+kA) z (2+kA) (3+kA) 


This  is  the  same  result  obtained  by  Watterson  (1974)  and  Stewart 
(1976)  . 


The  variance  has  a  maximum  value,  for  particular  values  of  4Ny 
and  k,  which  results  in  several  interesting  properties.  When  k  ->  oo  a 
maximum  variance  of  0.0508  is  obtained  when 


4Ny 


cos 


arc 


cos  • 


172 
6  7  3  /  2 


} 


n 

9 


0.493 


When  k  =  4  the  maximum  variance  is  0.0432  with  4Ny  =  0.439.  The 
maximum  variance  and  the  corresponding  4Ny  are  both  increasing 


functions  of  k. 
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As  shown  by  Li  and  Nei  (1975)  the  system  of  equations  can  be 
solved  at  generation  t  and  the  variance  is  given  by 


Var(t)(  l  p2)  =  Var(oo)(  l  p  )  -  xj ^  1+2$u)  + 
i=l  i=l 

Xi($ii)_$ii')  +  x2a2  +  x3(Anii+Aii)/irAiiirAn/irara2) 


(01  A  2 

where  =  2($n  -$n)  (16+10A+kA+kA  ) /(4+kA)  (5+kA) 

mi  A  mi  A 

“2  =  8(rill'riir3(tll  -»u)(2+A)/(4+kA))/(6+kA) 

A  -  (1  -  2S  -  2^rr> 


A  -  -  4-  3h^r> 


A  ■  -  4  -  ^ 


In  Figure  5.1  the  approach  to  equilibrium  is  shown,  starting  with  a 
completely  homozygous  population.  As  time  proceeds,  variability 
increases  within  the  population.  When  4Ny  <  0.5,  the  variance  of 
homozygosity  quickly  increases  and  then  asymptotically  approaches  its 
equilibrium  value.  When  4Ny  >  0.5,  the  variance  increases  up  to  and 
then  beyond  its  equilibrium  value.  This  is  because  the  variance  is 
maximum  when  4Ny  =  0.5.  The  increase  in  the  variance  is  large  when  4Ny 
is  large. 


Figure  5.1:  Variance  of  homozygosity  in  a  single  population  over 


time,  starting  with  a  completely  homozygous  population. 
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0.125 


Variance  of  homozygosity  for  a  structured  population 


To  determine  the  variance  of  homozygosity  within  subdivided 
populations  consider  n  subpopulations  each  with  N  diploid  individuals. 
Each  generation  a  proportion  m  of  the  gametes  in  each  subpopulation 
are  migrants  chosen  at  random  from  the  remaining  n~l  subpopulations. 
At  equilibrium  (or  with  initial  conditions  such  that  the  probability 
of  identity/non-identity  is  independent  of  the  numeration  of  the 
subpopulations,  eg:  ^  for  all  i  f  j  f  k.  are  equal)  this  model 
requires  a  minimum  of  17  coefficients.  These  coefficients  are  defined 
ill  Table  5.3,  where  a.  is  an  arbitrary  gamete  chosen  from  the  i~th 


IX 

subpopulation.  When  1/N,  y,  m  <<  1,  recursion  relationships  for  the 
expected  values  of  the  coefficients  over  replicate  populations  can  be 
found  and  are  given  in  Table  5.4.  An  analytical  solution  would  be 
difficult  to  find  for  this  system  of  equations.  Therefore,  the 
equations  were  solved  numerically  by  substituting  particular  values 
for  the  subpopulation  size,  mutation  rate  and  migration  rate. 


The  variance  of  homozygosity  within  the  i~th  subpopulation  is 


Var.  =  A _ +  A.  ...  . 

l  mi  li/ii 

and  the  covariance  of  homozygosity 
subpopulations  is 


2 

-  $.  . 
li 


between 


the  i~th  and  j~th 


2 

Cov ..  =  A. .. .  +  A. ... . 

ij  HJJ  ii/JJ  n 


When  gametes  are  sampled  at  random  from  the  subpopulations  the 
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Table  5.4:  Approximate  recursion  relationships  for  the  expected  values  of  the  identity 
coefficients  over  replicate  populations.  A  single  locus  in  a  structured  population. 
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+ 


apparent  homozygosity  is  approximately 


n  ii  n 


and  the  variance  of  homozygosity  is  approximately 


....  +  3(n-l)A....  +  6  (n-1)  (n-2)  A .  .  ..  + 
in  3  njj  iijk 


(n  DAjjyjj  +  2(n  1)  (n  2)4±1^k  +  4(n  l)(n  2>A1j/lk  + 


(n-l)(n-2)(n-3)a.j/kJ, 


The  following  results  are  for  k  ->  oo  or  k  =  4  and  n  =  4.  The  results 
for  other  values  of  k  and  n  are  similar  unless  stated  otherwise. 

The  variance  of  homozygosity  within  one  of  four  subpopulations 

-1  0 

for  k  oo  and  for  k  =  4  is  given  in  Figure  5.2  with  10  5  4Nu  5  10 

-3  2 

and  10  <  4Nm  <  10  .  The  results  illustrate  the  effect  of  the  maximum 

variance  when  4Ny  -  0.5.  Depending  on  the  amount  of  migration  the 
equilibrium  variance  will  be  higher  or  lower  than  that  expected  in  a 
single  population.  This  is  because  migrants  can  carry  new  alleles  with 
them,  augmenting  the  mutation  rate  so  that  it  is  closer  to  or  exceeds 
the  maximum.  When  k  =  4  (as  is  appropriate  for  a  single  nucleotide) , 
the  results  remain  qualitatively  the  same.  When  migration  occurs 
between  a  larger  number  of  subpopulations,  the  migrants  are  more 
likely  to  carry  different  alleles  and  thus  the  variance  changes  faster 
as  the  migration  rate  changes. 


Figure  5.2:  Equilibrium  variance  of  homozygosity  within  one  of 
four  subpopulations  (5.2a  k  ->  00 ,  5.2b  k  =  4)  . 


4Nm 


4Nm 


Variance  Variance 


When  the  subdivision  is  unknown  to  an  observer  or  a  hybrid 
population  is  considered,  the  expected  variance  of  homozygosity  is 
given  by  sampling  gametes  at  random  from  each  subpopulation.  The 

result  of  this  is  shown  in  Figure  5.3  with  a)  k  ->  °o,  b)  k  =  4  and  with 

-1  0-3  2 

10  <  4Ny  <  10  and  10  <  4Nm  <  10  .  When  the  migration  rate  is 

small,  the  results  change  as  k  changes.  This  is  because  the 
probability  of  picking  identical  alleles  from  two  subpopulations  is 

zero  when  k  ->  °°  and  m  =  0 ,  but  this  probability  is  1/4  when  k  =  4. 

Thus,  the  variance  of  homozygosity  is  small  when  k  ->  °°  but  remains 
relatively  large  when  k  =  4.  This  effect  is  more  dramatic  when  n  <  4. 

To  determine  the  transient  behavior  of  the  variance  of 
homozygosity,  each  subpopulation  is  assumed  to  be  initially  at 

equilibrium  with  m  =  0.  Migration,  at  a  constant  rate,  is  then 
introduced  and  the  change  in  the  value  of  the  coefficients  over  time 
is  followed  by  iterating  the  equations  in  Table  5.4.  This  is  done  for 
4Nm  =  10.0  and  k  -*  and  the  results  are  shown  in  Figure  5.4.  The 
equilibrium  that  will  eventually  be  reached  is  indicated  by  an  arrow. 
Within  a  single  subpopulation  (Figure  5.4a)  there  is  a  large  and  rapid 
decrease  in  the  variance  and  then  a  slow  increase  back  to  equilibrium. 
Lower  migration  rates  cause  slower  rates  of  change  and  higher 
migration  rates  cause  faster  rates  of  change,  but  the  results  remain 

similar.  Note  that  the  time  scale  is  very  large  in  these  graphs.  For 

4 

example,  if  N  =  10  and  a  generation  length  of  20  years  for  man  is 
assumed,  the  time  scale  covers  more  than  half  a  million  years. 

Nevertheless,  the  value  of  the  variance  of  homozygosity  at  time  t  =  0 
(when  4N p  =  0.125)  is  closer  to  its  equilibrium  value  than  at  time 
t  =  3N.  When  gametes  are  chosen  at  random  from  the  subpopulations,  the 
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Figure  5.3:  Equilibrium  variance  of  homozygosity  sampling  at 
random  from  four  subpopulations  (5.3a  k  ->  oo,  5.3b  k  =  4)  . 
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4Nm 


Figure  5.4:  Transient  variance  of  homozygosity  within-  one  of  four 
subpopulations  (5.4a  k  -*•  oo)  and  the  transient  variance  of 
homozygosity  sampling  at  random  from  four  subpopulations  (5.4b, 

k  ->  oo)  . 
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4  Nv 


o 


Variance 


results  shown  in  Figure  5.4b  are  obtained.  In  this  case  there  is  not  a 
drastic  decrease  in  the  variance.  In  general  there  is  a  slow, 
monotonic  increase  in  the  variance  toward  the  new  equilibrium.  Again 
when  4Ny  is  small  it  takes  a  large  number  of  generations  to  approach 
the  equilibrium.  When  k  and  n  are  small,  the  variance  of  homozygosity, 
picking  gametes  at  random  from  the  subpopulations,  does  not  decrease 
over  time.  In  this  case  the  initial  variance  of  homozygosity  is  close 
to  the  equilibrium  value  and  only  minor  changes  occur.  The  variance  of 
homozygosity  within  a  subpopulation  also  shows  smaller  changes  when  k 
is  small. 

Figure  5.5  shows  the  correlation  coefficient  for  homozygosity 

-1  0-3  2 

between  two  subpopulations  with  10  <  4Ny  <  10  and  10  <  4Nm  <  10  . 

The  correlation  coefficient  is  almost  identical  whether  k  ->  °°  (5.5a) 
or  k  =  4  (5i5b).  Malecot  (1948)  showed  that  when  4Nm  >  1,  the  alleles 
of  two  subpopulations  are  very  similar  ($..  -  $__) .  Figure  5.5  shows 
that  this  is  also  true  for  the  second  moment  of  frequencies  for 
several  subpopulations  and  with  an  infinite  or  finite  number  of 
alleles.  Over  a  short  range  of  4Nm,  the  subpopulations  change  from 
being  unrelated  to  strongly  correlated.  Presumably  this  is  also  true 
for  all  gene  frequency  moments. 

Most  natural  populations  are  subdivided  in  some  way  and  this 
creates  problems  for  many  statistical  tests.  For  example  Ewens '  (1972) 
method  for  estimating  4Ny  considers  only  a  single  population.  To 
extend  the  method  for  a  structured  population  would  be  difficult.  It 
is,  therefore,  necessary  to  known  how  strongly  the  subdivision  affects 
the  variance  of  homozygosity  (and  higher  moments)  relative  to  that 
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Figure  5.5:  Equilibrium  correlation  coefficient  of  homozygosity 
between  two  of  four  subpopulations  (5.5a  k  ->  co,  5.5b  k  =  4)  . 


83 


Correlation  Coefficient  01  Correlation  Coefficient 


expected  in  a  single  population.  To  do  this  we  have  used  the 
homozygosity  within  a  subpopulation  to  calculate  the  appropriate  value 
of  4Ny  for  a  single  population,  from  the  relation 
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4Ny 


1  -  0.  . 

li 


k 

k-1 


$  .  . 

li 


1 

k-1 


This  value  of  4Ny  was  then  substituted  into  Stewart's  (1976)  formula 
to  give  an  expected  variance  of  homozygosity.  The  ratio  of  the  true 
variance  to  this  expected  value  is  given  in  Figure  5.6.  As  can  be 
seen,  this  ratio  is  close  to  one  for  all  4Ny ,  4Nm  and  k  ->  °o,  k  =  4. 
The  maximum  and  minimum  of  the  ratio  is  1.039  and  0.881  in  Figure 
5.6a)  and  1.051  and  0.986  in  Figure  5.6b),  respectively.  When  n  =  2, 
the  ratio  is  even  closer  to  one.  If  the  behavior  of  higher  order 
moments  are  reflected  by  that  of  the  variance,  this  suggests  that  many 
statistical  tests  may  be  appropriately  applied  to  a  subpopulation 
which  has  migration  with  other  subpopulations.  Similar  results  were 
found  in  the  simulations  of  Ewens  and  Gillespie  (1974)  and  Slatkin 
(1982)  . 

If  however  the  subdivision  is  unknown  to  an  observer,  this  is  no 
longer  true.  Figure  5.7  shows  the  ratio  of  the  true  variance  to  the 
expected  variance  when  genes  are  sampled  at  random  from  four 
subpopulations.  When  k  is  large,  the  actual  variance  of  homozygosity 
is  much  smaller  than  the  variance  appropriate  for  the  expected  level 
of  homozygosity.  Therefore,  an  observer  must  know  the  subdivisions  of 
the  population  under  study.  The  simulation  studies  of  Ewens  and 
Gillespie  (1974)  suggested  that  population  subdivision,  with  larger 
migration  rates,  does  not  invalidate  the  use  of  Ewens'  theory.  The 
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Figure  5.6:  Ratio  of  the  actual  variance  of  homozygosity  within 
one  of  four  subpopulations  to  the  expected  variance  for  a  single 
population  with  the  same  variability  (5.6a  k  ->  oo ,  5.6b  k  =  4)  . 
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Figure  5.7:  Ratio  of  the  actual  variance  of  homozygosity 
gametes  at  random  from  four  subpopulations  to  the 
variance  for  a  single  population  with  the  same  variabil 
k  ->  oo,  5.7b  k  =  4)  . 


sampling 
expected 
ity  (5.7a 
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a) 


b) 


10 


Ratio 


simulation  of  Slatkin  (1982)  suggests  the  opposite  when  the  migration 
rates  are  small.  As  noted  by  Slatkin  this  is  due  to  the  difference  in 

the  migration  rates.  From  Figure  5.7a)  we  can  quantify  the  range  of 

0  1 

4Nm  where  this  change  takes  place  as  10  <  4Nm  <  10  .  Figure  5.7b) 

shows  that  the  ratio  is  again  close  to  one  when  k  is  small.  The 

maximum  and  minimum  of  the  ratio  is  1.057  and  0.061  in  Figure  5.7a) 

and  1.139  and  0.995  in  Figure  5.7b),  respectively.  Thus  even  if  a 

population  is  subdivided,  estimates  of  the  mutation  rates  of 
nucleotides  (eg:  using  Ewens '  1974  method)  may  be  appropriate. 


Summary 

The  variance  of  homozygosity  for  a  k-allele  model  with  n 
partially  isolated  subpopulations  is  derived  numerically  using 
identity  coefficients.  Within  a  single  population  the  variance  has  a 
maximum  of  approximately  0.05.  Thus  the  transient  variance  may 
increase  and  then  decrease  over  time  when  4N \i  >  0.5.  This  maximum  also 
causes  the  variance  within  a  subpopulation  to  depend  strongly  upon  the 
migration  rates  with  other  subpopulations.  The  variance  is  not 
strongly  influenced  by  the  number  of  alleles  possible  at  a  locus 
unless  the  population  is  presumed  panmictic,  but  is  actually 
subdivided.  When  the  latter  is  true,  the  variance  is  higher  with  small 
migration  rates  when  k  is  small.  The  transient  behavior  of  the 
variance  of  homozygosity  shows  that  a  large  number  of  generations  may 
be  required  to  approach  equilibrium  values.  The  results  suggest  that, 
in  many  situations,  the  variance  of  homozygosity  may  be  adequately 
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estimated  from  the  amount  of  variability  present.  If  the  results  for 
higher  order  moments  are  similar,  statistical  tests  need  not  consider 
structured  populations  as  a  special  case. 


Chapter  6 


Two-Locus,  Fourth  Order  Gene  Frequency  Moments: 
Implications  for  the  Variance  of  Squared  Linkage  Disequilibrium  and 

the  Variance  of  Homozygosity 


Introduct i on 


The  partial  differential  equations  for  a  diffusion  approximation 
which  describe  the  behavior  of  two,  linked,  neutral  loci  in  a  finite 
population  have  been  known  for  a  long  time  (Kimura,  1955).  However, 
the  equations  are  too  complicated  to  be  easily  solved.  One  way  to 
circumvent  this  problem  is  to  evaluate  only  the  moments  of  gene 
frequencies.  It  is  known  (eg:  Hill  and  Robertson,  1968;  Serant  and 
Villard,  1972;  Weir  and  Cockerham  1974;  Serant,  1974;  Hill,  1975;  Li 
and  Nei,  1975;  Strobeck  and  Morgan,  1978;  and  others)  that  these 
moments  follow  simple  recursion  relationships  which  can  be  solved. 
This  approach  is  used  here  to  evaluate  the  two-locus,  fourth  order 
gene  frequency  moments.  The  results  are  applied  to  two  problems;  the 
variance  of  the  expected  squared  linkage  disequilibrium  and  the 
variance  of  homozygosity  of  a  gene  with  recombination  between  two 
sites . 

Linkage  disequilibrium  is  a  measure  of  nonrandom  association 


between  alleles  at  different  loci.  It  can  be  defined  as 


where  f  is  the  frequency  of  gametes  with  the  i~th  allele  at  locus  A 
and  j~th  allele  at  locus  B  and  where  p.  and  q^  are  the  corresponding 
allele  frequencies.  Both  selection  and  random  drift  due  to  a  finite 
population  size  can  cause  a  nonrandom  association.  The  sum  of  squares 
of  the  linkage  disequi 1 ibria , 


2  2 

EZD.  .  =  EE  (f . .  -  p.q.) 

. .  1J  .  .  1J  1  J 

1J  1J 


is  a  measure  of  the  average  disequi 1 ibria  between  two  loci.  It  is  a 
component  of  the  squared  correlation  of  gene  frequencies,  enters  into 
the  standard  Chi  square  test  and  is  of  interest  itself.  Therefore,  it 
is  necessary  to  know  the  size  of  the  variance  that  can  be  expected  in 

a  natural  population.  The  results  presented  here  demonstrate  that  the 

EE  2 

standard  deviation  of  („D_^)  will  usually  be  larger  than  the  mean  and 
can  be  much  larger  for  realistic  mutation  rates. 


The  method  can  also  be  used  to  find  the  variance  of  homozygosity 
for  a  gene  consisting  of  two  sites  between  which  recombination  occurs. 
It  is  known  that  recombination  occurs  within  genes  and  the  discovery 
that  introns  are  prevalent  in  most  eukaryotic  genes  facilitates  such 
recombination.  It  has  been  shown  by  monte-carlo  simulation  (Strobeck 
and  Morgan,  1978)  that  intragenic  recombination  significantly 
increases  the  variance  of  homozygosity  if  4Ny  >  1.0  and  r  >  y.  The 
results  confirm  and  extend  this  prediction. 


93 


Theory 


Consider  two  loci  (denoted  A  and  B)  in  a  finite  population  with 
2N  gametes.  The  gametes  in  each  generation  are  produced  following  a 
Wright-Fisher  model  (Ewens ,  1979).  Let  the  mutation  rate  to  unique, 

selectively  neutral  alleles  be  Vj  and  V2  per  gamete  per  generation,  at 
loci  A  and  B,  respectively.  Let  the  recombination  rate  between  the  two 

loci  be  r.  Throughout,  it  is  assumed  that  1/2N,  Vj  ,  \>2,  r  <<  1  and 

2  2  2  2 

terms  of  order  (1/2N)  ,  vi ,  V2 ,  r  or  higher  are  neglected. 


To  define  the  necessary  system  of  equations  to  find  fourth  order 
moments  requires  a  minimum  of  50  identity  coefficients,  each  the 
probability  that  a  particular  sample  of  gametes  have  (or  do  not  have) 
identical  alleles.  These  coefficients  can  be  denoted 


$i jk/£mn/p/q 

To  define  these  coefficients,  consider  a  sample  of  i+j+k+il+m+n+p+q 

gametes  drawn  at  random,  without  replacement  (a  group  of  i  gametes,  a 

group  of  j  gametes  and  so  on) .  Signify  the  gametes  in  each  of  these 

I  I 

groups  with  superscript  roman  numerals,  eg:  let  a  (or  b  )  denote  the 

X  X 

allele  at  locus  A  (or  locus  B)  from  the  x~th  gamete  of  the  first  group 
of  i  gametes  and  a"^  denote  the  allele  at  locus  A  from  the  x~th  gamete 

X 

of  the  second  group  of  j  gametes,  etc.  Define 


$ijk/£,mn/p/q 


Prob  [  a?E  •  •  •  Ea^a^E  •  •  •  =an=aVII= . .  • =aVI1 
1  1  1  j  1  P 

IV_.  ___IV__V__  .__V__VIII_ _ VIII 


^a,  =*"=a 


1 


a 


=a1 = • • *=a  -an 
1  ml 


—  •  •  • 


-a 


and  b\ 


.  I  .  III_ 
. =b ,  = 

1  1 


III.  VIII_ 
=bk  =bl  = 


Eb 


VIII 
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^fE...=biv=b«=...5bVI=byIIE--.EbVI1 

J-  *>  1  n  1  d 


where  "="  should  be  read  "is  identical  to"  and  "f "  should  be  read  "are 
not  identical  to".  These  coefficients  are  defined  di agrammat i cal ly 
Figure  6.1  and  in  in  words,  ^-j_jk./£mn/p/q  the  probability  that  (i) 
all  genes  at  locus  A  carried  by  the  i  +  j+p  gametes  are  identical;  ( i i ) 
all  genes  at  locus  A  carried  by  the  £+m+q  gametes  are  identical;  (iii) 
that  the  alleles  (i)  and  (ii)  are  different;  (iv)  all  genes  at  locus  B 
carried  by  the  i+k+q  gametes  are  identical;  (v)  all  genes  at  locus  B 
carried  by  the  £+n+p  gametes  are  identical;  and  (vi)  that  the  alleles 
in  (iv)  and  (v)  differ.  Note  that  these  coefficients  apply  to  a 
specific  sample  of  gametes.  For  example,  the  probability  that  any  one 
of  three  genes  differ  is  while  *020/010/0/0  is  the 
probability  that  a  particular  gene  differs.  In  general  (with  i,  j,  k, 
£,  m,  n,  p  and  q  gametes  in  each  group) ,  the  number  of  ways  gametes 
can  be  sampled  is 


( i+j  +k+£+m+n+p+q ) 1 
i! j !k!£!m!n!p!q!x 

where,  if  i=£=p=q=0  then 

x=4  if  j^m^O  and  k=n=j=0 
x=2  if  j=m^0  or  k=n=|=0 
x=l  otherwise 
if  i,  £,  p  or  q4o  then 

x=4  if  i  =  £=p=q  and  j  =m  and  k=n 
x=2  if  i=q  and  £=p  and  j=m 


'v 
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Figure  6.1:  Diagrammatic  definition  of  the  identity  coefficients 
(All  of  the  alleles  in  a  vertical  column  must  be  identical,  a 


slash  separates  non-identical  alleles  and  the 
the  number  of  gametes  sampled) . 


letters  indicate 


Locus 


Locus 
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x=2  if  i=p  and  &=q  and  k=n 
x  =  2  if  i  =  £  and  p=q  and  j  =m  and  k=n 
x=l  otherwise 
For  convenience,  let 


$ijk/000/0/0  $ijk 


The  general  recursion  relationship  for  the  expected  values  of  the 
coefficients  over  replicate  populations  is  derived  in  Appendix  5.  The 
set  of  equations,  necessary  to  determine  the  variance  of  linkage 
disequilibrium,  is  given  in  Appendix  6.  Particular  values  for  the 
recombination  and  mutation  rate  were  substituted  into  these  equations. 
The  equations  were  then  numerically  solved  on  a  computer  to  determine 
the  equilibrium  values  of  the  identity  coefficients. 


The  variance  of  squared  linkage  disequilibrium  can  be  expressed 


as 


2  2 


2  2 


Var [ ZID. , ]  -  E [ (TED .  . )  ]  -  E[ZZD  ] 

ij  1J  ij  13  ij  1J. 

“  E[^^(fijfkr2fijfk2PkqS.+fijPkqr2fijfk£piqj+Afijfk!,PiPkqjq2 

22222  222222 

-  EU^fy-z^p^+pJqj2)]2 

4  3  222  3344 

"  E[^(fiJ-4fijPiqj+6fijPiqj-4fijPiqj+Piqj)] 


+  ECij4i(fijfkj"2fljfkjPkqj+fijPkq:1”2fijfk:!Piqj+4fi;ifkjPiPkq:1 


2  2  2 


2  3  2  2  2 


3  2  2  4 


-2fijPiPkqj+fkjPiqj-2fkJPiPkqj+PiPkqj)1 


+  E[“4j(fijfu-2fijfupiqji+fijpiqr2fijfLpiqj+4fijfiipiqjqi 


3  2 


4  2  2 


3  2222  _  _  ,  _ 

“2f  ij  Piqjqi+f  UPiqj-2f  UPiqj  q2+Piqj  V  ] 
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where  the 
£  over  all 
equivalent 


2  2  2 


2  2  2 


+  E[££  £  £  (f  f,  -2f  f  d  a  +f  Da  -2f  f  Da 

ijk+iA+j  iJ  kZ  iJ  iJPkq2.  ^ijtk2Piqj 

+4fijEUplpkqjqr2fijplPkqJqi+fUpiqj-2fkllptpkqjqi+pipkqjql)1 

-  Et“(V2ViVpiV>2 

sums  on  i  and  k  extend  over  all  alleles  at  locus  A  and  j 
alleles  at  locus  B.  Each  of  these  gene  frequency  moments 
to  an  identity  coefficient.  In  particular, 


and 

is 


$ijk/£mn/p/q  E 


i  £  p  q  jmkn-'i 

l  L  jai 


Therefore,  the  variance  can  be  expressed  as 

Var  [££D1j  ]  -  *400  -  4*311  +  6$222  "  4*133  +  *044 

+  *200/000/0/2  “  4*201/010/0/l  +  2* 202/020/0/0  +  4* 112/010/0/1 

"  ^ 113/0 20/0/0  +  *024/020/0/0  +  *200/000/2/0  “  4*210/001/l/0 

+  29 220/002/0 /0  +  4* 121/001/1/0  ”  4*131/002/0/0  +  *042/002/0/0 

+  *200/200/0/0  "  4*200/lll/0/0  +  2* 200/022/0/0  +  4*lll/lll/0/0 

2 

"  4*  111/0 22*/0/0  +  *022/022/0/0  "  (*200  ”  2*111  +  *022} 

Similarly,  for  a  gene  with  two  sites  the  variance  of  homozygosity  is 


Var 


r  „  „  2  1 


l  l  f 

i  J 


iJJ 


-  E 


2  2 

( 1 1 4> 


i  i 


1 1  f 
i  j 


ij 


-  E 


z  uti  + 1 1  1 44  +  n  y  44  +  n  z  1 44]  -  E2fi  14 

1  i  J  J  i  J  kfi  ^  i  j  4j  U  ij  4j  lj  li  j  ^ 


and  in  terms  of  identity  coefficients  it  is 


-  $400  +  ^200/000/0/2  +  $200/000/2/0  +  $200/200/0/0  $200 
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Results  &  Discussion 

Variance  of  squared  1 inkage  di sequi librium 

Significant  levels  of  linkage  disequilibrium  are  found  frequently 
in  partially  and  completely  selfing  populations  and  between  alleles 
associated  with  inversions.  In  other  cases,  extensive  surveys  of 
natural  populations  (Lewontin,  1974;  Nevo,  1978,’  Brown,  1979) 
generally  show  only  low  levels  of  linkage  disequilibrium.  However,  the 
variance  of  the  expected  squared  linkage  disequilibrium  is  not  known. 

Hill  (1977)  attempted  to  determine  the  coefficient  of  variation 
(C.V.)  for  the  squared  correlation  coefficient 

2  2 

r  =  E[D  / p  ( 1— p)  q  (1  —  q)  ] 

(where  D=f^-pq)  for  a  two  allele  model  in  segregating 
populations.  His  results,  using  a  Taylor's  series  approximation  for 
both  the  mean  and  variance,  indicated  that  the  C.V.  could  be  greater 
than  one  hundred  percent.  However,  the  remainder  term  in  the  Taylor's 
series  for  the  mean  can  be  large.  This  can  result  in  a  negative 
approximation  to  the  squared  correlation  coefficient  if  there  are  rare 
alleles  in  the  population.  This  is  shown  with  an  example  given  in 
Appendix  7.  This  example  assumes  that  12  replicate  populations  are 
observed,  ten  with  the  most  frequent  gamete  having  a  frequency  of 
0.9998,  and  the  other  two  replicate  populations  with  the  most  frequent 
gamete  having  a  frequency  of  0.97  (ten  populations  are  given  rare 


alleles  in  order  to  make  the  effects  of  the  rare  alleles  more 


noticeable).  When  the  expected  values  for  these  populations  are 

•  ♦  *9 

substituted  into  the  second  order  Taylor's  expansion  of  r 


2 

r 


E[D2] 

2  - -  - 

E [p ( 1-p) q ( 1-q) } 


E [D  p ( 1-p) q ( 1-q) ]  E[p  (1-p)  q  (1-q)  ] 

- 2 - -  +  — 2 - 

E[D  ] E [p ( 1-p) q ( 1-q) ]  E  [p ( 1-p) q ( 1-q) ] 


a  negative  squared  correlation  is  found.  Since  the  Taylor's 
approximation  to  the  squared  correlation  coefficient  can  be  negative 
when  the  mutation  rate  is  small,  the  C.V.  for  the  correlation  was  not 
calculated. 

The  coefficient  of  variation  for  linkage  disequilibrium  is  shown 
in  Figure  6.2  with  =  \>2  ~  v,  for  4NV  =  0.125,  0.25,  0.5,  1.0,  2.0, 
4.0.  It  can  be  seen  that  the  C.V.  will  be  less  than  one  hundred 
percent  only  when  4Nv  is  very  large.  When  4Nv  is  small,  the  standard 
deviation  is  several  times  the  size  of  the  mean.  The  results  in  Figure 
6.2  show  that  the  C.V.  is  relatively  constant  for  each  4Nv  when  4Nr  is 
large  or  small.  For  small  4Nv,  the  minimum  C.V.  is  reached  as  r  -*•  0 
and  the  maximum  when  4Nr  -  10.0.  Note  also  that  as  r  00  the 

expected  linkage  disequilibrium  approachs  zero  but  the  C.V.  remains 
relatively  constant.  This  demonstrates  that  the  distribution  must  be 

highly  skewed.  The  C.V.  appears  to  increase  exponentially  as  4Nv 

2 

decreases.  Since  the  C.V.  is  so  large  the  utility  of  tests  based  on  D 


must  be  questioned. 
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Figure  6.2:  The  percent  coefficient  of  variation  for  the  squared 
linkage  disequilibrium. 
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Variance  of  homozygos i ty 


The  coefficient  of  variation  of  homozygosity  when  a  gene  consists 

of  two  sites  with  recombination  between  them  can  also  be  determined. 

This  model  is  a  good  approximation  of  a  gene  with  two  -  exons  and  a 

single  intron.  Introns  occur  in  most  eukaryotic  genes  and  must 

significantly  increase  recombination  within  genes  since  they  can  exist 

in  large  numbers  and  can  be  a  major  portion  of  the  gene.  For  example; 

the  vitellogenin  genes  of  Xenopus  laevis  have  33  introns  within  each 

gene  (Wahli  et^  al_.  ,  1980);  The  a 2  type  I  collagen  gene  of  chickens  has 

more  than  49  introns  (Vogeli  et^  al .  ,  1981);  the  single  intron  of  the 

lie 

chloroplast  tRNA  gene  in  Zea  mays  is  more  than  927.  of  the  total 

length  of  the  gene  (Koch  £t^  al^.  ,  1981). 

In  Figure  6.3  and  Figure  6.4  let  y  =  2v  be  the  mutation  rate  of 
the  complete  gene  (each  site  within  the  gene  is  assumed  to  have  the 
same  mutation  rate,  v) •  Figure  6.3  shows  that  the  C.V.  of  homozygosity 
increases  when  recombination  occurs  and  when  the  mutation  rate  of  the 
complete  gene  is  large,  as  predicted  by  Strobeck  and  Morgan  (1978). 
When  the  r  to  y  ratio  is  small,  the  maximum  C.V.  occurs  when 
0  =  8Nv  -  1.5  (Figure  6.3).  When  the  mutation  rate  is  small,  the 
effects  of  recombination  become  smaller  since  recombination  acts  only 
on  variability  already  present.  The  variance  of  homozygosity  when  r  is 
large  is  given  by 

40(6+60+02) 

(1+0) 4 (2+0) 2 (3+0) 2 


One  possible  way  to  adjust  for  the  effects  of  intragenic 
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Figure  6.4: 
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recombination  would  be  to  increase  the  value  of  0  to  match  the 
increased  variability  due  to  recombination.  In  a  sense,  trying  to  find 
an  "effective  mutation  rate".  This  can  be  done  by  setting 
0  -  <l/*200)  -  I-  This  value  of  0  gives  a  single  locus  the  same 
expected  homozygosity.  The  expected  variance  of  homozygosity  for  a 
single  locus  model  with  this  0  is 

20 

(1+0) 2  (2+0)  (3+0) 

(Watterson,  1974;  Stewart,  1976).  However,  Strobeck  and  Morgan  (1978) 
argued  that  the  variance  of  a  single  locus  model  even  with  an 
increased  0  would  underestimate  the  true  variance.  In  Figure  6.4  the 
ratio  of  the  true  variance  of  homozygosity  to  the  adjusted  variance  of 
a  single  locus  model  is  compared  for  0.0  <  0  <  4.0.  When  the  amount  of 
recombination  is  large,  the  ratio  quickly  increases  above  one  as  the 
mutation  rate  increases.  The  ratio  is  small  only  when  0  <  0.5  and  r 
is  small.  Therefore,  intragenic  recombination  can  not  be  modelled  by 
increasing  the  mutation  rate  because  the  variance  and  presumably  all 
other  moments  about  the  mean  are  changed. 


Summary 


Identity  coefficients  are  used  to  construct  a  sufficient  set  of 
equations  to  determine  the  fourth  order  moments  of  gene  frequencies 
for  two  linked  loci.  This  allows  the  variance  of  the  expected  squared 


linkage  disequilibrium  to  be  found.  It  is  shown  that  the  coefficient 


of  variation  is  generally  greater  than  one  and  if  the  mutation  rate  is 
small,  the  standard  deviation  is  more  than  four  times  the  size  of  the 
mean.  This  demonstrates  that  squared  linkage  disequilibrium  is  a 
highly  variable  quantity.  The  variance  of  homozygosity  for  a  gene 
which  consists  of  two  sites  can  also  be  obtained.  Recombination 
between  these  sites  increases  the  variance  of  homozygosity,  suggesting 
that  intragenic  recombination  significantly  changes  all  the  expected 
moments  of  gene  frequencies  if  4Ny  >  1.0  and  r  >  y . 


Chapter  7 


Conclusions 


Throughout  the  preceeding  chapters  the  method  of  identity 
coefficients  has  been  used  to  solve  several  problems.  This  method  is 
not  the  only  way  in  which  the  problems  can  be  solved.  Evaluating  the 
moments  of  the  continuous  diffusion  approximation  would  give  similar 
answers.  Wright's  path  coefficients  will  give  exactly  the  same 
answers.  There  are  probably  many  different  ways  in  which  these 
problems  can  be  approached.  The  advantage  of  the  method  of  identity 
coefficients  (and  the  other  probability  methods  mentioned  in  Chapter 
1)  is  that  it  is  simple  and  intuitive.  Indeed,  the  recursion 
relationships  almost  write  themselves.  When  some  quantity  has  been 
determined  to  be  of  interest  and  it  has  been  defined  in  terms  of  a 
probability,  then  a  recursion  relationship  can  be  found  for  this 
quantity  using  simple  probability  arguments.  In  writing  such  a 
recursion  relationship  it  is  often  found  that  other  probabilities  are 
required.  Recursion  relationships  for  these  probabilities  can  be 
written  and  may  suggest  that  still  more  are  required.  Eventually  a 
complete  set  is  determined  and  can  then  be  solved.  As  stated  by 
Cockerham  (1967),  "While  Malecot's  definitions  and  methods  must  lead 
to  the  same  results  as  does  Wright's,  they  are  generally  easier  to 
grasp  and  apply,  requiring  only  simple  probability  arguments,  for 
those  not  well  versed  in  path  coefficients".  The  simplicity  of  this 
method  makes  it  very  useful  and  the  preceeding  chapters  give  only  a 
slight  indication  of  what  can  be  done  using  identity  coefficients. 
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This  method  has  been  used  to  examine  several  properties  of 
linkage  disequilibrium  in  Chapters  2,  3  and  6.  In  general  these 
studies  point  out  a  few  problems  that  must  be  considered.  First, 
higher  order  linkage  di s equi 1 ibr ia  among  several  loci  need  not  be  a 
strong  indication  of  the  effects  of  selection.  Although  smaller  than 
two-locus  disequilibria,  three-locus  disequi 1 ibria  is  of  the  same 
order  of  magnitude.  Secondly,  the  sum  of  squares  of  linkage 
disequilibrium  is  not  a  "well-behaved"  function  in  a  partially 
selfing,  finite  population  (Chpt.  3).  The  linkage  disequilibria  may 
increase  or  decrease  with  different  rates  of  selfing  depending  on  the 
mutation  rates.  This  problem  can  be  circumvented  by  considering  the 
standard  squared  linkage  disequilibrium,  a  quantity  related  to  the 
correlation  of  gene  frequencies.  However,  Chapter  6  shows  that  there 
are  problems  with  this  quantity  as  well.  The  squared  standard  linkage 
disequilibrium  is  a  first  order  Taylor's  series  approximation  to  the 
correlation  coefficient.  It  is  shown  in  Chapter  6  that  a  second  order 
Taylor's  series  approximation  can  be  negative  when  there  are  rare 
alleles  in  the  population.  This  questions  the  accuracy  of  the  Taylor's 
series  approximation  of  the  squared  standard  linkage  disequilibrium  to 
the  correlation  coefficient.  In  this  chapter  we  also  demonstrate  that 
a  major  component  of  this  quantity,  the  sum  of  squares  of  the  linkage 
disequilibrium,  has  a  very  large  variance  (particularly  with  realistic 
mutation  rates) .  These  chapters  have  considered  only  the  value  of  the 
parameters  in  the  whole  population.  It  is  necessary  to  determine  their 
expected  values  in  a  sample  but  such  a  theory  would  be  difficult  to 


develop. 


The  effects  of  intragenic  recombination  have  been  examined  in 
Chapters  2,  4  and  6.  These  studies  demonstrate  several  properties  of 
this  process.  Even  though  intragenic  recombination  may  be  rare  it  can 
have  significant  effects  in  some  situations.  In  hybrid  individuals  and 
individuals  with  an  interracial  background  there  is  a  significant 
chance  that  they  may  have  unusual  or  unique  combinations  of  sites 
within  their  genes.  It  would  be  preferable  to  determine  the  actual 
number  of  alleles  created  in  hybrid  populations  by  intragenic 
recombination  but  the  effective  number  of  alleles  is  suggestive.  The 
importance  of  these  new  alleles  depends  partly  upon  their  fitnesses. 
Unfortunately,  very  little  is  known  about  the  relative  fitnesses  of 
different  combinations  of  sites  within  genes. 

The  results  in  Chapter  6  suggest  that  intragenic  recombination 
may  significantly  alter  the  distribution  of  gene  frequencies.  This 
depends  on  the  sizes  of  the  mutation  rates  and  the  amount  of 
recombination  between  sites.  If  these  two  processes  are  sufficiently 
large,  most  of  the  models  in  common  use  will  be  compromised.  For 
example,  Ewens 1  (1972)  method  to  estimate  the  parameter  4Np  will  give 
an  upwardly  biased  answer.  Since  the  actual  sizes  of  the  recombination 
and  mutation  rates  are  not  known  precisely,  it  is  not  clear  how  large 
the  effects  of  intragenic  recombination  will  be.  The  results  in 
Chapter  2  however,  demonstrate  that  the  effects  of  recombination 
between  more  than  two  sites  within  a  gene  may  not  have  to  be 
considered.  This  study  shows  that  at  least,  the  overall  homozygosity 
is  accurately  modelled  by  just  a  two-site  model  over  a  wide  range  of 


parameters . 


The  variance  of  homozygosity  in  a  structured  population  is 
examined  in  Chapter  5.  It  is  shown  here  that  the  variance  strongly 
depends  on  the  amount  of  migration  between  subpopulations.  The 
expected  variance  can  however,  be  accurately  estimated  using  the 
amount  of  variability  present  within  the  subpopulations.  This  suggests 
that  the  structure  of  a  population,  unlike  intragenic  recombination, 
alters  the  distribution  of  gene  frequencies  in  a  simple  fashion.  This 
result,  along  with  those  of  Ewens  and  Gillespie  (1974)  and  Slatkin 
(1982) ,  demonstrates  that  the  standard  theories  for  a  randomly  mating 
population  may  be  applicable  to  a  strucutured  population. 
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Appendix  2. 

Recursion  Relationships  for  the  Expected  Values  of  the 
Identity  Coefficients  Over  Replicate  Populations  for 
a  Partially  Selfing,  Finite  Population. 


V/A)’  ■  C1-^ 


S(W(a/aD)  +  C1'S)$(A)(A) 


’(B/B)’  " 


S  Qi+HV (b/b] ^  *  (1'S)*(B)(B) 


*CA) (A) '  “  (1"U) 


Sf^^CA/A)3  +  C1‘  S'3* (A)  (A), 


A  '  -  fl-vV 

CB)(B)  U  VJ 


’(AB/AB)’  ’  Cl-U)2a-v52 


S[(l-r)2A1  +2r(l-r)A2  +  r  +  (1-S)* 


(AB)  (AB) 


‘(AB)(A»)'  •  (1-“)2(1-v)2 


»-r)2[^j  *  (1-  jj)«(AB)  (AB)]  *  Ml-r)[iA2  ♦  (1-  f)»(AB)(A/B)l 


2  X  1 

+  r  [|^1  +  (1-  n34(A/B)  (A/B)^ 


*  (AB)  (A/B) '  “  ‘•1_U3 


S{d-r)C^3  ♦  Cl-  ♦  r[^3  ♦  (1- 


+  (l-S){(l-r)[^3  +  ♦  (1-  f)r(AB)(A)(B^ 


+  +  N^4  +  (1_  N3r(A/B)  (A)  (B) 


]} 


* (AB/B) (A) ’  =  Cl-U)2(l-V)2 


S^NA3  +  N3n4^  +  Cl-S){ Cl-r)[^21  +  j^3  ♦  (1-  N) T (AB) (A) (B) ^ 


♦#>2  -  Cl- for 


(A/B) (A)(B)^J 


(AB/A) (B) 


(l-U)2(l-v)2 


S^3  +  (1-  +  Ci-sHCi-r)^  +  Jn4  *  Ci-  J)r(AB)(A)(B)] 


*  r^2  +  ^4  +  (1‘  N3r(A/B)  (A)  (B)^_ 
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V/A)  (B/B)  ' 

*(A/B)(A/B)' 

F  (AB)  (A)  (B)  ' 

r(B/B)(A)(A) 

% 

r(A/A)(B)(B) 

r (A/B) (A) (B) 

A(A)(B)(A)(B) 

where  \\  ■ 

A2  - 


(i-u)2(i-v)2(s2[iA3  ♦  (i-  Jdq5]  *  s(i-s)[^(AB/A)(B)  ♦  (i-  |)r(A/A)(BHB)] 


+  S Cl-S)  [j^I>  (;AB/B)  (A)  +  C1‘  N)r(B/B)  (A)  (A)-' 


(l.S)2[-i _ a  ♦  4(N-2)n  +  lN-^jjN-3^  ] 

*•'  J  LN(N-1)  6  N  (N-l)  3  N(N-l)  a  (A)  (B)  (A)  (B) J 


a-y)2(l-v)2[s2[ij\j  .  (1-  i)06]  •  2S(1-S)[i«3  •  io4  .  (1-  |)r(A/B)(A)(B)] 


*  C1"SJ  ^N(N-1)(115  ’  V  *  N(N-1)(I11  *  n2  *  2IV  *  ^  N(N-l)  ^(A)  (B)  {A)  (B)^ 


(l-y)2(l-v)2 


<l-r)t4‘3  *  Hwj  *  s  *  V  ♦  wujpjf  ] 


N  N 


_r  1.  N-l,„  „  „  ,  (N-l) CN-2)„ 

rC^3  +  N2(n2  +  °3  +  V  +  ^  ^2  ^(A/B)(A)(B)JJ 


2  2  | 


'  =  (i-u)  (l-v)  f s [— 4a 3  +  — (2«4  +  n5)  +  (N-~--zN'— n2] 


N  N 


+  ( l-s)  [ — +  n2  +  2n3)  +  ^~z(n1  +  4n3)  +  2^N  A(A)  (B)  (A)  (B)  ] 


1  -  ( i-y) 2( l-v) 2 


s  [ — 2-a3  +  ^(2«3  +  n5)  +  (N-1}2N-2)ni] 

N  N  N 


+  (i-s)[-^(n1  +  n2  +  2n4)  +  ^-|(n2  +  An3)  +  2')2N  ^a(a)  (b)  (a)  (b)  ^ 


2  2  f  l  M_  1 

’  -  (1-y)  (l-v)  -r  X‘  ’ 


s[4*3  +  ^(03  +  «4  +  n5)  +  (N"1)2N~2^n3] 

N  N  N 


+  (1— S)  [ — 2"(^3  “*•  f!i|  +  0 )  +  — 2"(ni  +  II2  +  3 II 3 )  +  — - ^4 - ^"A 


(A)(B)(A)(B) 


'  =■  ( l-y) 2 ( l-vf  -4a3  +  ^4-(2n3  +  2n4  +  n5  +  2n6)  +  ( —  1--)4N--^-(n1  +  n2  +  4n3) 


N  N 


+  (N-l)(N-2)(N-3). 


(A) (B) (A) (B) 


*5  +  W 


(AB/AB) 


^(A/A)  +  lif(B/B) 


(A/A)  +  *(B/B)  +  * 


A3  -  \  +  hx 


(AB/AB) 
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°1  =  ^  (AB)  (AB)  +  ^ (AB) (A/B) 
n2  =  ^CABKA/B)  +  ^CA/B)  (A/B) 
n3  =  ^(BJCB)  +  ^ (AB/A) (B) 

R4  3  *** CAJ  (AD  +  ^(AB/BHA) 

n5  =  h  *  W(A/A)  +  **f(B/B)  +  ^(A/A)  (B/B) 

°6  =  W(AB)(AB)  +  (AB)  (A/B)  +  ***  (A/B)  (A/B) 


"l  *  Ji*(B)(B)  +  ljr(A/AKB)CB) 
n2  “  is$(A)(A)  +  Jir(B/B)(A)  (A) 


n3  3  JsI’  (AB)  (A)  (B)  +  ^(A/B)  (A)  (B) 

If  N»1  ,  p“0(i)  ,  v“0(i)  ,  and  r=0(i)  ,  then  these  equations 
can  be  approximated  by 


(A/A) ’ 

■  S(‘i*W(A/A)J  * 

(1-S)* 

(B/B)' 

•  * 

(1-S)* 

^  (AB/AB)  =  S(!5+Js'1'(AB/AB);)+  fl"S)$CAB)  (AB) 

*  (AB)  (A/B)  3  S(li4(AB)  (AB)+Js*(AB)  (A/B)5  +  (1_S5  T  (AB)  (A)  (B) 

*(AB/B)(A)‘  3  SC^(A)(A)+J5*(AB/B)(A):)  +  (1‘S)r(AB)(A)(B) 

(Ai) 

4(AB/A)(B)'  “  SC^(B)(B)+Js*(AB/A)(B)5  +  (1"S)  T  (AB)  (A)  (B) 

*  (A/A)  (B/B)'  '  s2(!<+}if(A/A)+!i,i'(B/B)+5j®(A/A)(B/B))  +  S  (1“S)  (  r(A/A)  (B)  (B)  +  r(B/B)  (A)  (A)  5 

2 

+  (1_S)  A(A)  (B) (A) (B) 

*  (A/B)  (A/B)  ‘  “  s2(^(AB)(AB)+Js4(AB)(A/  B)+5<®(A/B)(A/B))  +  25  (1‘S)  T  (A/B)  (A)  (B) 

+  (1-S)2A 


(A)  (B)  (A)  (B) 
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r  (B/B)  (A)  (A)  '  =  S(Js*(A)(A)+l5r(B/B)(A)(A))  *  (1 'S)  A (A)  (B)  (A)  (B) 

F (A/ A)  (B)  (B) '  “  S(Jl$(B)(B)+Jir(A/A)(B)(B))  +  (1'S:)A(A)  (B)  (A)  (B) 
r (A/B)  (A)  (B)  '  =  S(Jir(AB)(A)(B)+J5l'(A/B)(A)(B))  +  (1_S) A(A)  (B)  (A)  (B) 

neglecting  terms  of  O(jjj-)  or  less  and 


$(AB) CAB) ' 
r(AB) CA) CB) ' 

A(A)  (B)  CA)  CB)  ' 


*  CA)  CA)  ’  "  N  (A/A)  5  +  N  '2U^CA)(A) 


*  CB)  CB)  '  =  N^+J3f  (B/B)-1  +  C1"  N  '2v)$(B)(B) 


(AB/AB)  3  +  C1‘  N  -2U-2v-2r)*fAB)  (AB)  *  2r*(AB)(A/B) 


(A2) 


(A)  (A)+5i4  CAB/B)  CA)  +J5<X>  CB)  (B)  (AB/A)  (B)+3s$  (AB)  (AB)+J**  (AB)  (A/B)  ^ 

+  C1'  jj  -  2w-2v'r;)r(AB)  (A)  (B)  +  rr(A/B)  (A)  (B) 

N  ^  CA)  (A) +Jsr  (B/B)  (A)  (A)  +h*  (B)  (B) (A/A)  (B)  (B) +2F  (AB)  (A)  (B) +2F  (A/B)  (A)  (B) } 
+  (1'  t  '2U'2V)A(A)(B)(A)(B) 


neglecting  terms  of  0(— .,)  or  less. 


Appendix  3. 

Recursion  Relationships  for  the  Expected  Values  of  the  Identity 
Coefficients  Over  Replicate  Populations  for  Two  Loci  with  Two  Subpopulations. 
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(abUaUb).  *  2N1(''(a)1(a)/f(b)1(b),+*(ab)1(ab),)  +  (1’  2N,  'r'3rni'2ui'2v'2,r(ab)1  (a),  (b),  +  r4(a),  (b),  (a),  (b),  *  m'r(ab)?(a),  (b) 
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(ab),(a),(b) 


(ab)2(a)2(b),  '  2N7*'(b)1(b)2  +  (1‘  2*7  -r'mi '2n,2'2''* -2v2)r(ab)2(a )2(b),  +  ri(a)2(b)) (a )2(b)2  +  m ‘ r ( a b ) 2 ( a )2 ( b ) 2  +  "^(ab),  (a )2(b) 
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Appendix  4. 

Recursion  Relationships  for  the  Expected  Values  of  the  Identity  Coefficients 
Over  Replicate  Populations  for  the  Variance  of  Homozygosity  in  a  Single  Population. 
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3m(1-h)  Tmr  2(2N-1)(1-  )  +  (2N-1 )  (2N-2)[2(1-  T-hdr..,,  +  (1-  rMr 
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2m(1-m)  op  (2N-1)(1-  rh-)(l-*„)  +  (2N-1 ) ( 2N-2 ) [5(1-  +  (1-  pfy)r 


(2N-1 ) (2N-2) (2N-3)[(1  - 


136 


Appendix  5 


Derivation  of  the  Recursion  Relationship 
for  Two-Locus,  Fourth  Order  Moments 


To  determine  the  value  of  ^jk/iimn/p/q  (the  value  of  the 
coefficient  in  the  next  generation)  requires  that  i+ j+k+£+m+n+p+q 
gametes  be  drawn  at  random,  without  replacement,  from  the  present 
generation.  For  the  moment,  assume  that  none  of  these  gametes  are  the 
result  of  mutation  or  recombination  in  the  previous  generation.  Assume 
also,  that  no  two  of  these  gametes  are  copies  of  a  single  gamete  in 
the  previous  generation.  This  will  be  true  with  approximate 
probabi 1 i ty 

1  ~  V]_  (i  +  j  +  £+m+p+q)  -  \>2  (i+k+  5+n+p+q)  -  r(i  +  £+p+q) 

-  %(i  + j +k+ £+m+n+p+q) (i+ j+k+£+m+n+p+q-l) 

When  none  of  these  events  occur,  the  sample  of  gametes  will  satisfy 
the  required  structure  among  alleles  with  probability 

^ i j  k / £mn/p/ q 

This  is  the  first  term  of  the  recursion  relationship  given  below;  no 
events  have  occured  which  change  the  probability  from  one  generation 
to  the  next. 

When  two  of  the  gametes  are  copies  of  one  gamete  in  the  previous 
generation  it  is  necessary  to  determine  their  probability  of  identity 
/  non-identity.  It  is  convenient  to  consider  the  sample  of  gametes  in 
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groups  (as  in  the  definition  of  the  coefficients).  When  two  gametes 
from  the  same  group  are  copies  of  one  gamete,  their  probability  of 
identity  is  one  and  therefore,  the  probability  that  the  complete 
sample  of  gametes  are  identical  /  non-identical  is  equivalent  to  the 
probability  for  a  sample  with  one  less  gamete  in  that  group.  For 
example,  if  two  of  the  gametes  in  group  I  are  copies  of  one  gamete 
(this  can  happen  in  (^)  ways),  then  the  probability  is 

^ i - 1 j  k /£  mn/p / q 

The  probabilities  are  more  complicated  when  two  gametes  from  different 
groups  are  copies  of  one  gamete  in  the  previous  generation.  The 
derivation  of  these  probabilities  will  be  indicated  with  three 
examples.  Example  1:  From  the  definition,  gametes  in  groups  I  and  IV 
must  have  different  alleles.  Therefore  if  gametes  from  group  I  and 
group  IV  are  copies  of  a  single  gamete  in  the  previous  generation, 
then  the  probability  that  these  gametes  are  not  identical  is  zero. 
Example  2:  If  gametes  from  group  I  and  II  are  copies  of  one  gamete 
(this  can  happen  in  ij  ways),  then  the  probability  is  equivalent  to 
the  probability  with  one  less  gamete  in  group  II  because  the  identity 
of  group  I  and  II  for  this  gamete  is  assured,  ie: 

^i j  —  1 k / £mn/p/ q 

Example  3:  If  gametes  from  group  II  and  III  are  copies  of  one  gamete 
(this  can  happen  in  jk  ways) ,  then  the  probability  is  equivalent  to 
that  with  an  extra  gamete  in  group  I,  consisting  of  an  A  locus  from 
group  II  linked  to  a  B  locus  from  group  III.  ie: 

^i  +  1 j  —  1 k — 1 / £mn/ p/q 

The  remaining  probabilities  are  derived  in  a  similar  manner. 


i 
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When  recombination  has  occured,  the  resulting  gamete  is  the  union 
of  two  loci  from  two  different  gametes.  Thus,  if  a  gamete  in  group  I 
is  the  result  of  recombination  (this  can  happen  in  i  ways  with 
probability  r) ,  then  the  probability  of  identity  /  non-identity  in  the 
previous  generation  is 

^i-1 j+lk+l/£mn/ p/q 

That  is,  the  probability  with  one  less  gamete  in  group  I  and  one  more 
in  groups  II  and  III.  The  probability  of  recombination  can  be  ignored 
for  gametes  in  groups  II,  III,  V  and  VI  since  only  a  single  locus  is 
considered  for  these  groups.  For  groups  IV,  VII  and  VIII  the 
probability  is  similar  to  that  for  group  I. 

When  a  locus  has  a  mutational  event  the  probability  of  identity  / 
non-identity  is  usually  zero  since  an  infinite  alleles  model  is 
assumed.  There  are  however,  a  few  special  cases  where  such  mutations 
can  contribute  to  the  probability.  These  occur  when  there  is  only  one 
allele  of  a  locus  (A  or  B)  which  must  be  different  from  one  or  more 
alleles  of  that  locus.  This  allele  will  be  different  with  probability 
one  if  a  mutation  occurs.  It  is  then  required  that  the  remaining 
gametes  have  the  correct  probability  structure.  If  more  than  one 
allele  must  be  identical  at  the  locus  then  a  mutation  can  not  be 
allowed.  Therefore,  to  insure  only  one  allele  is  present,  a  Dirac 
delta  function  is  used 

6  (x)  =  0  if  xfO 
6  (x)  =1  if  x  =  0 

The  mutational  events  which  contribute  to  the  probability  are  shown 
towards  the  end  of  the  equation. 


"  • 
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Putting  all  of  this  together,  the  general  recursion  relationship 
for  the  expected  value  of  the  identity  coefficient  over  replicate 
populations  is 


^ijk/imn/p/q  "  (  1  "  2^  ^(i+j+k+i+n*n+P+<l)  (i+j+k+t+nH-n+p+q-l)  -  Vj  (i+j+i+nri-p+q) 


-  v  2  (  i+k+i+n-t-p+q )  -  r(i+i,+p+q) 


^ijk/imn/p/q  +  2N  ^^i-ljk/imn/p/q  + 


2N  ^ij-lk/imn/p/q  +  2N  iklI>ijk-l/2nin/p/q  +  2N  ^  ^-1^<,>ij-lk/£im/p/q  + 

2N  -^l+lj-lk-l/Jimn/p/q  +  2N  '^ij-lk/imn-l/p+l/q  +  2N  ^^ij-lk/inm/p/q  + 
2N  ^k^k  ^  *ijk-l/tmn/p/q  +  2N  km<^ijk-l/£m-ln/p/q+l  +  2N  '^ijk-l/imn/p/q  + 


2N  ^ ^ijk/i-lmn/p/q  +  2N  ^m*ijk/im-ln/p/q  +  2N  ^^ijk/i.mn-l/p/q  + 


2N  ^ijk/i.ns-ln/p/q  +  2N  ^^ijk/Jl+lm-ln-l/p/q  +  2N  ^^ijk/im-ln/p/q  + 


2N  ^^ijk/Zmn-l/p/q  +  2N  ^^ijk/imn-l/p/q  +  2N  isp<'p  ^  *ljk/Hinn/p-l/q  + 


2N  ^^ijk/imn/p/q-l  +  ^^i-lj+lk+l/imn/p/q  +  S'r<J’ijk/i-lmf lrt+l/p/q  + 


pr*ij+lk/Jlmn-(-l/p-l/q  +  qr*ijk+l/£.m+ln/p/q-l  +  ^  ^  (J'hO  ^OOk+l/Jlmn/O/q  + 


(  J-l)  6  (±+p)  4>00k/ imn/0/q  +  vl6(p-1)6(i+J)400k/i.nm+l/0/q  +  Vl6(jl~1)'5(lIffq)<I>ijk/00iri-l/p/0 


v16(m-l)6(Z+q)«i;]k/00n/p/0  +  v1«(q-l)«(i+™)*ljfcfl/00n/p/0  +  V26(i-l)5(k+q)*0J+10/Wp/0 


V2<5(k”l)6(i+q)'1'0j0/j,mn/p/0  +  v2'5  (q-1)  5  (i+k)  ^ojO/Unrt-ln/p/O  +  V26  (i-1)  5  (rrt-p)  jk/Onrf  10/0/q  + 


v26 (n-1) 6  (i+p)  ®1jk/0m0/0/q  +  V26(p_1)6(i+n)4'i;]+lk/0m0/0/q 
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The  recursion  relationship  can  be  solved  at  equilibrium  to  give 


jk/imn/p/q  (1+l+i+m+'P+cO 0i  +  (i+k+i+nfp+q) ©2  +  (i+i+p+q)R  + 


(1+j+k+i^+p+q)(1+j+k+i+nrHl+p+q_1)j  .  i(i-l)«1.ljk/tnn/p/q  +  2ij$i:j_lk/Wp/q  + 


;>ilc<J>iJk-l/i.mn/p/q  +  j(j  +  2J i-»- 1  j  —  lk— 1  / 2-nm/p/q  +  ^n<tij-lk/!lran-l/p+l/q  + 


2 jp^i j_ ^k/ imn/p/q  +  ^^ijk-l/imn/p/q  +  ^^^ijk-l/Zm-ln/p/q+l  +  ^^^^ijk-l/imn/p/q  + 


1)  *ijk/j{,_lnin/p/q  +  ^m^ijk/2.m-ln/p/q  +  ^^'n<*>ijk/Jlmn-l/p/q  +  m^®“^  ^ijk/Jlm-ln/p/q 


2mn*ijk/j;,+im_in_l/p/q  +  ^“^^ijk/im-ln/p/q  +  n^n  ^  jk/imn-l/p/q  +  2np*ijk/Ji,jnn-l/p/q  + 


p(p  1)  +  ^  ^ijk/imn/p/q-1  +  ^^^i-lj+lk+l/inm/p/q  +  ^^ijk/i-lnH-ln+l/p/q  + 


pk*ij+lk/£mn+l/p-l/q  +  ^^ijk+l/inrf  ln/p/q-1  +  01^^  ^  ^  ^+P^00k+l/Jlmn/0/q  + 


016(J-l)5(i+p)*ook/Wo/q  +  015(p-l)(S(i+j)*oOk/Jlmn+i/o/q  +  916  6  (ntf<l)  4ijk/00rrf  l/p/0  + 


015(m-l)6(A+q)*1Jk/OOii/p/o  +  0^  (q-1)  6  (Mm)  *1;)k+1/00n/p/0  +  Q^i-^^ViO/Wp/O 


9  2^  (  k— 1)6  (i+q)  j0/  imn/p/Q  +  025  (q_1)  6  (i+k)  $0j0/)lmfln/p/0  +  026  6  (n+p)  $i  jk/0nH-10/0/q 


02<5  (n-l)6  (i+p)  *ijk/0m0/0/q  +  025  (p_1^  5  (i+n^<I,ij+lk/0ia0/0/q 


where  0^  »  4Nvi,  ©2  "  4NV2,  and  R  -  4Nr 


Appendix  6 


The  Necessary  Set  of  Equations 


for  Two-Locus,  Fourth  Order  Moments 


Several  properties  of  the  coefficients  must  be  used  to  insure  the 
minimal  number  of  equations.  From  the  definition  of  the  coefficients 


it  is  apparent  that 


$ 


ijk/£mn/p/q  ^£mn/ijk/q/p 
^pjn/qmk/ i/£ 
^qmk/ p jn/£/ i 


A  further  group  of  identities  among  the  coefficients  occurs  when  a 
single  A  (or  B)  allele  must  be  different  from  one  or  more  A  (or  B) 
alleles.  These  probabilities  can  be  obtained  as  the  sum  of  two  other 
probabilities.  A  well  known  example  for  a  single  locus  is  that  the 
expected  heterozygosity  equals  one  minus  the  expected  homozygosity. 
Similar  reasoning  leads  to 


=  $ 


10k/£mn/0/q  00k+l/£mn/0/q  00k/£mn/o/q+l 


00k/ £m+ln/ 0/ q 


^01k/£mn/0/q  ^00k/£mn/0/q 

^00k/£mn/l/q  ^00k/£mn+l/0/q  ^00k/£+lmn/0/q 


^ 1 j  0 / £mn/p/0 
$0j l/£mn/p/0 
^0j  0/ £mn/ p/1 
$i jk/1 0n/p/0 
^i jk/01n/p/0 
$ijk/00n/p/l 


^0j+10/ £mn/ p/0 


%  0/ £mn/ p+1/0 


=  $ 


0 j  0/ £mn/p/0 
^0j0/£m+ln/p/0 
^i jk/00n+l/p/0 
^i jk/00n/ p/0 
^i  jk+l/00n/ p/0 


^0j  0/£mn+l/p/ 0 


-  $ 


-  $ 


Oj  0/ £+lmn/ p/0 
jk/00n/p+l/q 


ij+lk/00n/p/0 


-  $. 


i+1 jk/00n/ p/0 


$ 


=  $ 


-  $ 


ijk/lm0/0/q  i jk/0m+10/0/q  i jk/0m0/0/q+l 
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. 


^ijk/Oml/O/q  ^ijk/OmO/O/q  ^  i  j k+1 /OmO/O/q 

^ijk/OmO/l/q  ^i j+lk/OmO/O/q  ^ i+1 j k/OmO/O/q 

When  a  single  A  (or  B)  allele  must  be  different  from  zero  A  (or  B) 
alleles,  then 

$01k/00n/0/0  ~  $00k/01n/0/0  "  $00k/00n/0/0 
$10k/00n/0/0  =  ^OOk/OOn/O/1  “  '^OOk+l/OOn/O/O 
$00k/10n/0/0  =  $00k/00n/l/0  =  $00k/00n+l/0/0 
$0jl/0m0/0/0  =  ^Oj  0/ Oml /O/ 0  =  $0j0/0m0/0/0 
$lj0/0m0/0/0  =  $0j0/0m0/l/0  =  $0j+10/0m0/0/0 
^OjO/lmO/O/O  =  $0j0/0m0/0/l  =  $0j 0/0m+10/0/0 
These  identities  can  be  derived  by  expressing  the  coefficients  as  gene 
frequency  moments. 

Since  it  is  assumed  that  V]_  =  V2  =  v,  at  equilibrium  the 
coefficients  are  symmetrical  for  the  A  and  B  loci  and  therefore 

$i  jk/jimn/p/q  ^ik  j  /  £nm/q/p  ^£nm/ikj/p/q 

^qkm/pnj/ i/£ 

%nj/qkm/£/i 

The  following  definitions  are  used 

if  i+ j+k+f+m+n+p+q  <  1  then  *ijk/Wp/q  *  1 

^i  jk/000/0/0  =  *ijk 
0  =  4Nv 
R  =  4Nr 

The  necessary  systems  of  equations  are  derived  from  the  general 
equilibrium  equation  given  in  Appendix  5  and  using  the  above  rules. 
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There  are  four  independent  equations  and  ten  systems  of  equations  with 
the  remaining  46  coefficients.  These  are 


*020(2+29)  "  2 
<fO3O^6+30^  “  6<tl020 
^040^ 12+i>0)  "  12$030 

A  *  * 

‘*,O2O/O2O/O/O(12+40)  ”  Alf020  ~  4lI>030 


2+40+2R 

-2R 

0 

<X> 

200 

2 

-2 

6+46H-R 

-R 

X 

& 

111 

s 

4* 

020 

0 

-8 

12+40 

$ 

m  022m 

A  A 

.  02°. 

-  ^  “ 

6+50+2R  -2R  0 

A 

210 

2*  +  a$ 

020  200 

-4  12+50+R  -R 

X 

a 

121 

=  ■ 

2<*’030  +  6<t>  1 1 1 

0  -12  20+50 

—  — 

A 

v032 

24>030  +  6<t022 

—  ^ 

* 

12+60+2R 

-2R 

0 

A 

220 

2A 

^030 

+ 

10*210 

-6 

20+60+R 

-R 

X 

a 

131 

X 

2# 

040 

+ 

12*121 

0 

-16 

30+60 

r 

o  > 
o 

l10  _ 

_2*040 

+ 

12$032_ 

6+60+3R  .  -3R  0  0 

-2  12+60+2R  -2R  0 

0  -8  20+60+R  -R 

0  0  -18  30+60 


A 

A 

* 

300 

6$200 

a 

211 

2<tlll  +  8<l210 

4> 

122 

12»121 

•e  > 
o 

OJ 

LO 

i 

124>032 

12+70+3R 

-3R 

0 

0 

A 

310 

6*210  +  6*300 

-4 

20+7O+2R 

-2R 

0 

X 

a 

221 

2*121  +  4*220  +  10*211 

0 

-12 

30+70+R 

-R 

$132 

6*131  +  12*122 

0 

0 

-24 

42+70 

*043 

6*042  +  12*033 

12+80+4R 

-4R 

0 

0 

0 

A 

A 

400 

12*300 

-2 

2O+80+3R 

-3R 

0 

0 

a 

311 

6*211  +  12<t,31o 

0 

-8 

3O+80+2R 

-2R 

0 

X 

$ 

222 

= 

2*  +  20* 

122  221 

0 

0 

-18 

42+80+R 

-R 

$ 

133 

24* 

132 

0 

0 

0 

-32 

56+80 

* 

044_ 

24* 

043  J 

145 


1 - 

- 1 

o 

CM 

o 

CO 

<  ■©■ 

<  ■©• 

o 

<3- 

CM 

CM 

CO 

CM 

1 

O 

| 

<  -©• 

< 

o 

CM 

CM 

CM 

1 

O 

1 

<  e> 

<  e- 

o 

o 

CM 

CM 

CM 

o 

— t 

CM 

CM 

CM 

+ 

o 

<  O 

<  e- 

< 

CM 

CM 

**3 

+ 

| 

+ 

<  0 

o 

CM 

O 

CO 

O 

"■H 

+ 

o 

<  ■©■ 

<  e- 

CM 

c 

o 

CM 

1 

o 

O 

— 

o 

o 

o 

CM 

CM 

CM 

o 

O 

o 

<  ■©■ 

< 

CM 

o 

CM 

CM 

o 

<  o 

' — 

, 

CM 

i 

o 

o 

<—> 

o 

1 

\ 

o 

o 

1—4 

o 

o 

o 

— . 

CM 

o 

o 

o 

o 

<  o 

V. 

— 

1 

o 

o 

o 

o 

CM 

1  t 

CM 

CM 

o 

o 

O 

o 

' — . 

o 

o 

1—4 

CM 

o 

•“H 

— 4 

CM 

CM 

r— < 

o 

<  e» 

<  0 

<  e- 

< 

«■ 

o 

X 

r 

o 

o 

os 

© 

o 

i 

vO 

± 

o 

co 

oS 

os 

os 

+ 

vO 

CM 

CM 

© 

—-4 

OS 

1 

1 

vO 

+ 

1 

1 

o 

CM 

os 

CM 

+ 

os 

o 

© 

o 

CM 

vO 

1 

1 

CM 

oS 

OS 

CM 

CO 

s 

o 

CM 

o 

cfe 

VO 

1 

r-^ 

+ 

CM 

CM 

w—4 

1 _ 

1 

1 - 

O 

O 

o 

CM 

o 

\ 

o 

o 

CM 


<  ■©■ 
<r 


o 

\ 

o 

\ 

c 

CM 

o 


^3- 

+ 


CM 
<  ■©• 
CM 


vO 


O 

o 

o 

CM 

o 

CM 

CM 

o 


<  •©• 
vO 


+ 


:  ■©> 
>3" 


CM 

<  -0- 


<  0 

CM 


co 

co 

o 


o 


<  ■©• 
^3- 


CM 

CM 

o 


<  0 

CM 


O 

o 

o 

CM 

o 


o 
o 
o 

-H  CM 


o 

o 

\ 

o 

CM 

o 


o 

o 

o 

CM 

o 


o 

CM  -H  c 
<  0  <  ^  <  0 


OS 

CM 

I 


as 

CM 

I 


OS 

CM 


£ 

CM 


OS 

I 


OS 

& 


o 

-f 

CM 

<r 


>3* 

CM 

I 


oS 

CM 


£ 

CM 


CO 

I 


I 


o 

o 

o 

o 

o 

CM 


<  0 
VO 


o 

o 

^-4 

o 


o 

o 

o 

CM 

o 


o 

CM 
:  ■©• 
O 


O 

\ 

o 

\ 

o  o 

CM 

o  o 

CM  O 

—<  CM 

O 


o 
o 
co 
<  0 
<r 


+ 

iH 

+ 

<  O 

CM 

CO 

o 

1 

r—i 

CM 

i—4 

CM 

o 

CO 

<  o 

CM 

+ 

<  •€> 

0 

o 

<  ■©• 

CM 

CM 

r-H 

CM 

CM 

co 

1 

+ 

1 

+ 

o  — < 


<r 


CM 

CM  CM 
■©■  <  ■©•  < 
CM  <T 


<  e- 

<r 


CO 

<T 

O 


<r 

o 


<  «• 

>3* 


CM 

C  -i 

o 

CM 

CM 

CO 

CM  CO 

CM 

<r 

o 

| 

CM  *-H 

CM 

+ 

o 

e* 

<  ■©•  <  e- 

<  <©• 

<  •©■ 

CM 

CM 

cm 

CM 

<r 

CO 

CO 

+ 

o 

+ 

+ 

—4 

<  e- 

<  ■©• 

>3* 

r—4 

CM 

CM 

CM 

CM 

CO 

—i 

o 

<  O 

<  •©• 

CM 

CM 

CM 

CM 

\ 

O 

o 

o 

o 

o 

o 

CM 

<  «• 


o 

o 

1—4 

o 


o 

CM 
<  ■©■ 


o 

\ 

o 

o 

CM 


o 

o 

\ 

o 

CM 

o 


o 
— . 
o 

o 

CM 

o 


CM  O' 

o  - 

CM  — 

<  -e-  <  e- 


o 

o 

o 

CM 

o 

MT 

CM 

o 


OS 

CM 

I 


OS 

<r 

I 


oS 

Mf 

+ 

CD 

CO 

+ 


OS 

I 


OS 

CM 


OS 

CO 

£ 

00 

£ 

CM 


CM 

I 


OS 

CM 


£ 

00 

£ 


os 

£ 

00 

+ 

CM 

*3- 


© 

00 

+ 

VO 

m 


CM 

CO 

I 


oS 

CM 

£ 

00 

£ 

co 


00 

I 


146 


Appendix  7 


O 

Taylor's  Series  Approximation  to  r": 
An  Example 

Let  D  =  f -  pq 


fll 

#i  -  //io 

0.9998 

Population 

#11 

0.9700 

#12 

0.9700 

Average 

f  12 

0.0001 

0.0100 

0.0100 

■  'i- 

f  21 

0.0001 

0.0200 

0.0100 

- 

f  22 

0.0000 

0.0000 

0.0100 

- 

-16 

-8 

-5 

-6 

D2 

1.0000x10 

4.0000x10 

9.2160x10 

7.6833x10 

p(l-p)q(l-q) 

-9 

—4 

-4 

-5 

9.9980x10 

1.9404x10 

3.8416x10 

4.8192x10 

2 

-25 

-12 

-8 

-9 

D  p(l-p)q(l-q) 

9.9980x10 

7.7616x10 

3.5404x10 

2.9510x10 

2  ,  .2  2  ,  .  2 

-17 

-8 

-7 

-8 

p  (1— p)  q  (1-q) 

9.9960x10 

3.7652x10 

1.4758x10 

1.5436x10 

2 

-8 

-4 

-1 

-2 

D  /p(l-p)q(l-q) 

1.0002x10 

2.0614x10 

2.3990x10 

2.0009x10 

-147- 


Using  the  averages  as  expected  values,  the  Taylor's  series 
approximations  to  r^  are 


r2  =  E[D2/p (1-p) q(i-q) ] 

=  0.020009 

r2  -  E[D2] /E[p (1-p) q (1-q)  ] 

=  0.15943 


2  e[d2] 

r  2  - - - 

E[p(i-p)q(i-q) ] 


E[D  p( 1-p) q ( 1-q) ]  E[p  (1-p)  q  (1-q)  ] 

- Z -  +  — z - 

E [D  ]E[p(l-p)q(l-q)]  E  [p ( 1-p) q( 1-q) ]  _ 


=  -0.051557 
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